facebookresearch / off-belief-learning

Implementation of the Off Belief Learning algorithm.
Other
43 stars 7 forks source link

Torchscript casting error #5

Closed ravihammond closed 2 years ago

ravihammond commented 2 years ago

Following the discussion from the the issue I posted, where I was experiencing silent deadlocks when running theobl1.sh script, it was suggested by @hengyuan-hu for me to try a new CUDA version. I decided to throw a hail mary, and try the latest version of Pytorch.

Here are the details of my new software setup inside a docker container:

It compiled successfully, but when I run the script, I'm experiencing a new torchscript error:

Traceback (most recent call last):
  File "selfplay.py", line 237, in <module>
    belief_model,
  File "/app/pyhanabi/act_group.py", line 45, in __init__
    runner = rela.BatchRunner(agent.clone(dev), dev)
RuntimeError: Unable to cast Python instance of type <class 'torch._C.ScriptModule'> to C++ type 'torch::jit::Module'

I suspect that I'm experiencing this error because torchscript has changed in the latest PyTorch. I will investigate further and report back here once I've figured out some more information.

If you have any idea what might be causing this issue, I'd be very happy to hear your thoughts!

hengyuan-hu commented 2 years ago

Hi, this is the error that I observed when I was trying to compile with pytorch 1.7.1 (without trying your script yet, this is the old problem that basically stopped me from upgrading to newer pytorch). Unfortunately we don't have a solution to this yet. This is likely caused that a mismatch between the pybind used for this repo and the pybind used by the pytorch release build.

Any luck build pytorch from scratch? That may be the easiest fix to get you started.

hengyuan-hu commented 2 years ago

Just found the version info for pybind in pytorch in the commit message https://github.com/pytorch/pytorch/tree/master/third_party

I tried using the same version but the casting still does not work. Will let you know if I have any progress.

hengyuan-hu commented 2 years ago

First checkout pybind to the version used by pytorch. For the latest one, it should be git checkout v2.6.2

Then adding set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\"") on this line https://github.com/facebookresearch/off-belief-learning/blob/73e734da15758752a1fbdfea4d371eff61e2ae72/CMakeLists.txt#L8 should fix the problem.

ravihammond commented 2 years ago

I've just tried, it's stopped giving the error, thank you so much! Now I'll see if this new version of CUDA gives me silent deadlocks.

hengyuan-hu commented 2 years ago

With this fix we will also switch to a newer version internally. Let's see if we can reproduce the deadlock problem ourselves.

ravihammond commented 2 years ago

Okay, I've successfully run obl1.sh and obl2.sh without any deadlocks, illegal move errors, or casting errors. Thanks so much for your help @hengyuan-hu!

hengyuan-hu commented 2 years ago

Resolved.