Closed ravihammond closed 2 years ago
Hi, this is the error that I observed when I was trying to compile with pytorch 1.7.1 (without trying your script yet, this is the old problem that basically stopped me from upgrading to newer pytorch). Unfortunately we don't have a solution to this yet. This is likely caused that a mismatch between the pybind used for this repo and the pybind used by the pytorch release build.
Any luck build pytorch from scratch? That may be the easiest fix to get you started.
Just found the version info for pybind in pytorch in the commit message https://github.com/pytorch/pytorch/tree/master/third_party
I tried using the same version but the casting still does not work. Will let you know if I have any progress.
First checkout pybind to the version used by pytorch. For the latest one, it should be
git checkout v2.6.2
Then adding
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\"")
on this line https://github.com/facebookresearch/off-belief-learning/blob/73e734da15758752a1fbdfea4d371eff61e2ae72/CMakeLists.txt#L8 should fix the problem.
I've just tried, it's stopped giving the error, thank you so much! Now I'll see if this new version of CUDA gives me silent deadlocks.
With this fix we will also switch to a newer version internally. Let's see if we can reproduce the deadlock problem ourselves.
Okay, I've successfully run obl1.sh
and obl2.sh
without any deadlocks, illegal move errors, or casting errors. Thanks so much for your help @hengyuan-hu!
Resolved.
Following the discussion from the the issue I posted, where I was experiencing silent deadlocks when running the
obl1.sh
script, it was suggested by @hengyuan-hu for me to try a new CUDA version. I decided to throw a hail mary, and try the latest version of Pytorch.Here are the details of my new software setup inside a docker container:
It compiled successfully, but when I run the script, I'm experiencing a new torchscript error:
I suspect that I'm experiencing this error because torchscript has changed in the latest PyTorch. I will investigate further and report back here once I've figured out some more information.
If you have any idea what might be causing this issue, I'd be very happy to hear your thoughts!