aiqm / torchani

Accurate Neural Network Potential on PyTorch
https://aiqm.github.io/torchani/
MIT License
446 stars 125 forks source link

Error when loading the CUDA ANI model in C++ #609

Closed ndonyapour closed 2 years ago

ndonyapour commented 2 years ago

Hello,

I am running into an issue when I'm trying to load an ANI CUDA model in C++. I created and saved TorchScript of my ANI model in Python. I don't have any issues with loading the saved TorchScript model in Python. I get the error below when I load it into my C++ script.

I'll be grateful if you help me to get this working.

Error:

terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():
Unknown type name '__torch__.torch.classes.cuaev.CuaevComputer':
Serialized   File "code/__torch__/torchani/aev.py", line 15
  training : bool
  _is_full_backward_hook : None
  cuaev_computer : __torch__.torch.classes.cuaev.CuaevComputer
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  cuaev_enabled : bool
  aev_length : Final[int] = 384

Aborted (core dumped)

Here is my code for reproducing the error. ani_cuda.zip

isayev commented 2 years ago

You can't serialize a model with compiled custom CUDA kernels. This should work with the default use_cuda_extension=Flase If you want to use them, save model first, re-load (if needed) and enable afterwards

yueyericardo commented 2 years ago

Hi, you should be able to load the cuaev.cpython-38-x86_64-linux-gnu.so under torchani folder. I will look at this later today.

ndonyapour commented 2 years ago

I have already tried saving a CPU model and loading it in C++, then moving it on CUDA. In this case, it runs once and dies afterward. Here is the error

successfully loaded the model
terminate called after throwing an instance of 'c10::Error'
  what():  Error in dlopen or dlsym: libnvrtc-f4909b87.so.11.0: cannot open shared object file: No such file or directory
Exception raised from checkDL at /home/nazanin/programs/pytorch/aten/src/ATen/DynamicLibrary.cpp:23 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7fa2c01a58b5 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7fa2c01a3021 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0xedf79e (0x7fa2b3c2179e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #3: at::DynamicLibrary::DynamicLibrary(char const*, char const*) + 0x62 (0x7fa2b3c21bf2 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xc3d4a2 (0x7fa2aa2a54a2 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xc3d8a1 (0x7fa2aa2a58a1 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #6: torch::jit::tensorexpr::CudaCodeGen::CompileToNVRTC(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x1c8 (0x7fa2aa78fc58 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #7: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0x172a (0x7fa2aa794a8a in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x11399f4 (0x7fa2aa7a19f4 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #9: torch::jit::tensorexpr::CreateCodeGen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::tensorexpr::Stmt*, std::vector<torch::jit::tensorexpr::CodeGen::BufferArg, std::allocator<torch::jit::tensorexpr::CodeGen::BufferArg> > const&, c10::Device, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7b (0x7fa2b6222a4b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #10: torch::jit::tensorexpr::TensorExprKernel::compile() + 0x907 (0x7fa2b62b78c7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #11: torch::jit::tensorexpr::TensorExprKernel::TensorExprKernel(std::shared_ptr<torch::jit::Graph> const&) + 0x2e8 (0x7fa2b62b84e8 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x33f0a46 (0x7fa2b6132a46 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x3433919 (0x7fa2b6175919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x3435fe7 (0x7fa2b6177fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x343686b (0x7fa2b617886b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x34369ff (0x7fa2b61789ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x3435fcc (0x7fa2b6177fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #20: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7fa2b616e08b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x341f855 (0x7fa2b6161855 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x341fed9 (0x7fa2b6161ed9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x3433919 (0x7fa2b6175919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #24: <unknown function> + 0x3435fe7 (0x7fa2b6177fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #25: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0x343686b (0x7fa2b617886b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #27: <unknown function> + 0x34369ff (0x7fa2b61789ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x3435fcc (0x7fa2b6177fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #30: <unknown function> + 0x343686b (0x7fa2b617886b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x34369ff (0x7fa2b61789ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #32: <unknown function> + 0x3435fcc (0x7fa2b6177fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #33: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #34: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7fa2b616e08b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x3424318 (0x7fa2b6166318 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #36: <unknown function> + 0x3449701 (0x7fa2b618b701 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #37: <unknown function> + 0x344a18e (0x7fa2b618c18e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #38: <unknown function> + 0x34227d4 (0x7fa2b61647d4 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #39: torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x39 (0x7fa2b5ed73c9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #40: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x143 (0x7fa2b5ee3253 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #41: torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >) + 0xe8 (0x435ef6 in ./test_model)
frame #42: main + 0x1042 (0x42d60f in ./test_model)
frame #43: __libc_start_main + 0xf3 (0x7fa2a92c6493 in /lib64/libc.so.6)
frame #44: _start + 0x2e (0x42c3be in ./test_model)

Aborted (core dumped)

ani_cuda.zip

yueyericardo commented 2 years ago

Hi, please try whether this works: https://github.com/yueyericardo/cuaev_cpp

cc @IgnacioJPickering

ndonyapour commented 2 years ago

Thank you for taking the time and working on this issue. I cloned the repo and followed the instructions you wrote. This is the error:

[ 20%] Building CXX object cuaev/CMakeFiles/cuaev.dir/cuaev.cpp.o
[ 40%] Building CUDA object cuaev/CMakeFiles/cuaev.dir/aev.cu.o
/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(799): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined

/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(810): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined

/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(833): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined

/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(957): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined

/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(1045): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined
yueyericardo commented 2 years ago

Hi, please check the readme and build again. https://github.com/yueyericardo/cuaev_cpp It is currently working for cuda and cpu when not using cuaev. I'm still checking the cuaev issue.

yueyericardo commented 2 years ago

cuaev also works now on my local machine. let me know if it works on your side. Just curious, what are you using this for?

ndonyapour commented 2 years ago

It is working for me too! We are using this model as an OpenMM external force plugin to run MD simulations. Thank you for getting the cuaev working. I was wondering why loading the CPU model and moving it on CUDA is not working even though all its functions are implemented by Torch?

yueyericardo commented 2 years ago

Hi, if not using cuaev, cpu and gpu model should be fine for both cases. if using cuaev (cuda_extension=True), the model has to been on gpu, because in the begining of the design, https://github.com/aiqm/torchani/blob/master/torchani/cuaev/cuaev.cpp#L153, the tensors of aev_params (EtaR_t, ShfZ_t etc) are assumed already in the gpu if calling cuda_extension==True, and these tensors is not supported to change device in the current code. A workaround now could be save different models on disks, and use different one accordingly.

ndonyapour commented 2 years ago

I am not sure what the problem is, but the GPU model is not working without using cuaev. My goal was to have a CPU model and move it on GPU without dealing with cuaev, but it runs just once. Unfortunately, I was not able to fix this issue. I attached the code. Use save_ani_cpu.py to create the model. ani_cuda.zip

yueyericardo commented 2 years ago

please check the commit at https://github.com/yueyericardo/cuaev_cpp/commit/0ebf55a527e92f1e3f56585ffbdd1d0137a43839, it should print out

First call:  [2, 5, 384]
Second call: [2, 5, 384]
ndonyapour commented 2 years ago

Thank you! It works! It seems that we have to compile the code with cuaev even when we are not using it. In the folder that I've attached, I am not using cuaev.

ndonyapour commented 2 years ago

Also, your model is still on CPU and has not moved on GPU.

yueyericardo commented 2 years ago

could you try https://github.com/yueyericardo/cuaev_cpp/blob/main/test_model.cpp#L22

model = torch::jit::load(argv[1], device);

instead of model.to(cuda)?

ndonyapour commented 2 years ago

Same error. Would it be possible to try the codes inside the attached folder below? ani_cuda.zip

yueyericardo commented 2 years ago

Your attched code worked fine. to make it cpu model, I changed the following, it also works.

diff --git a/models/cpu_model.pt b/models/cpu_model.pt
index 29688aa..1d90e3b 100644
Binary files a/models/cpu_model.pt and b/models/cpu_model.pt differ
diff --git a/save_ani_cpu.py b/save_ani_cpu.py
index a8ac783..1d32ef5 100644
--- a/save_ani_cpu.py
+++ b/save_ani_cpu.py
@@ -6,7 +6,7 @@ import torchani

 def save_cpu_aev():
-    device = torch.device('cuda')
+    device = torch.device('cpu')
     tolerance = 5e-5
     Rcr = 5.2000e+00
     Rca = 3.5000e+00
diff --git a/test_model.cpp b/test_model.cpp
index c656257..b75b89a 100755
--- a/test_model.cpp
+++ b/test_model.cpp
@@ -21,7 +21,7 @@ int main(int argc, const char* argv[]) {
   torch::Device device(torch::kCUDA);
   try {

-       model = torch::jit::load(argv[1]);
+       model = torch::jit::load(argv[1], device);
        std::cout << "successfully loaded the model\n";
   }

@@ -30,7 +30,7 @@ int main(int argc, const char* argv[]) {
        return -1;
   }

-  model.to(device);
+  // model.to(device);
   torch::Tensor coords = torch::tensor({
        {{0.03192167, 0.00638559, 0.01301679},
         {-0.83140486, 0.39370209, -0.26395324},

what kind of error you gor?

ndonyapour commented 2 years ago

Here is the error:

successfully loaded the model  
first call[2, 5, 560]
terminate called after throwing an instance of 'c10::Error'
  what():  Error in dlopen or dlsym: libnvrtc-f4909b87.so.11.0: cannot open shared object file: No such file or directory
Exception raised from checkDL at /home/nazanin/programs/pytorch/aten/src/ATen/DynamicLibrary.cpp:23 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f634b5908b5 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7f634b58e021 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0xedf79e (0x7f633f00c79e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #3: at::DynamicLibrary::DynamicLibrary(char const*, char const*) + 0x62 (0x7f633f00cbf2 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xc3d4a2 (0x7f63356904a2 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xc3d8a1 (0x7f63356908a1 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #6: torch::jit::tensorexpr::CudaCodeGen::CompileToNVRTC(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x1c8 (0x7f6335b7ac58 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #7: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0x172a (0x7f6335b7fa8a in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x11399f4 (0x7f6335b8c9f4 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #9: torch::jit::tensorexpr::CreateCodeGen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::tensorexpr::Stmt*, std::vector<torch::jit::tensorexpr::CodeGen::BufferArg, std::allocator<torch::jit::tensorexpr::CodeGen::BufferArg> > const&, c10::Device, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7b (0x7f634160da4b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #10: torch::jit::tensorexpr::TensorExprKernel::compile() + 0x907 (0x7f63416a28c7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #11: torch::jit::tensorexpr::TensorExprKernel::TensorExprKernel(std::shared_ptr<torch::jit::Graph> const&) + 0x2e8 (0x7f63416a34e8 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
63416a34e8 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x33f0a46 (0x7f634151da46 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x3433919 (0x7f6341560919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x3435fe7 (0x7f6341562fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x343686b (0x7f634156386b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x34369ff (0x7f63415639ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x3435fcc (0x7f6341562fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #20: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7f634155908b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x341f855 (0x7f634154c855 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x341fed9 (0x7f634154ced9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x3433919 (0x7f6341560919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #24: <unknown function> + 0x3435fe7 (0x7f6341562fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #25: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0x343686b (0x7f634156386b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #27: <unknown function> + 0x34369ff (0x7f63415639ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x3435fcc (0x7f6341562fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #30: <unknown function> + 0x343686b (0x7f634156386b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x34369ff (0x7f63415639ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #32: <unknown function> + 0x3435fcc (0x7f6341562fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #33: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #34: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7f634155908b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x3424318 (0x7f6341551318 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #36: <unknown function> + 0x3449701 (0x7f6341576701 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #37: <unknown function> + 0x344a18e (0x7f634157718e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #38: <unknown function> + 0x34227d4 (0x7f634154f7d4 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #39: torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x39 (0x7f63412c23c9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #40: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x143 (0x7f63412ce253 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #41: torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >) + 0xe8 (0x436900 in ./test_model)
frame #42: main + 0x1106 (0x42db93 in ./test_model)
frame #43: __libc_start_main + 0xf3 (0x7f63346b1493 in /lib64/libc.so.6)
frame #44: _start + 0x2e (0x42c87e in ./test_model)

Aborted (core dumped)
yueyericardo commented 2 years ago

have no idea, I'm using the environment created at https://github.com/yueyericardo/cuaev_cpp/blob/main/README.md#build-instruction

Are you using the same? cuda 11.3 latest torch and torchani?

ndonyapour commented 2 years ago

The only difference is that I'm not compiling with cuaev.

yueyericardo commented 2 years ago

I'm not sure, but

what():  Error in dlopen or dlsym: libnvrtc-f4909b87.so.11.0: cannot open shared object file: No such file or directory

this seems to be linked with cuda11.0 library instead of cuda11.3?

image

ndonyapour commented 2 years ago

Ok. I get the following error when I compiled it with Libtorch from Pytorch Conda installation.

successfully loaded the model
first call[2, 5, 560]
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torchani/aev/___torch_mangle_0.py", line 117, in forward
      _38 = torch.cumsum(_36, 0, dtype=None, out=_37)
      _39 = torch.index_select(cumsum, 0, pair_indices)
      sorted_local_index120 = torch.add_(sorted_local_index12, _39, alpha=1)
                              ~~~~~~~~~~ <--- HERE
      _40 = annotate(List[Optional[Tensor]], [sorted_local_index120])
      local_index12 = torch.index(rev_indices, _40)

Traceback of TorchScript, original code (most recent call last):
  File "/home/nazanin/programs/torchani/torchani/aev.py", line 254, in forward
    mask = (torch.arange(intra_pair_indices.shape[2], device=ai1.device) < pair_sizes.unsqueeze(1)).flatten()
    sorted_local_index12 = intra_pair_indices.flatten(1, 2)[:, mask]
    sorted_local_index12 += cumsum_from_zero(counts).index_select(0, pair_indices)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

    # unsort result from last part
RuntimeError: The size of tensor a (46) must match the size of tensor b (51) at non-singleton dimension 1

Aborted (core dumped)
ndonyapour commented 2 years ago

It seems that I've been compiling my code with CUDA11.0 instead of CUDA11.3. Thank you!

yueyericardo commented 2 years ago

Great! you are welcome