Closed ndonyapour closed 2 years ago
You can't serialize a model with compiled custom CUDA kernels. This should work with the default use_cuda_extension=Flase
If you want to use them, save model first, re-load (if needed) and enable afterwards
Hi, you should be able to load the cuaev.cpython-38-x86_64-linux-gnu.so
under torchani folder. I will look at this later today.
I have already tried saving a CPU model and loading it in C++, then moving it on CUDA. In this case, it runs once and dies afterward. Here is the error
successfully loaded the model
terminate called after throwing an instance of 'c10::Error'
what(): Error in dlopen or dlsym: libnvrtc-f4909b87.so.11.0: cannot open shared object file: No such file or directory
Exception raised from checkDL at /home/nazanin/programs/pytorch/aten/src/ATen/DynamicLibrary.cpp:23 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7fa2c01a58b5 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7fa2c01a3021 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0xedf79e (0x7fa2b3c2179e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #3: at::DynamicLibrary::DynamicLibrary(char const*, char const*) + 0x62 (0x7fa2b3c21bf2 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xc3d4a2 (0x7fa2aa2a54a2 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xc3d8a1 (0x7fa2aa2a58a1 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #6: torch::jit::tensorexpr::CudaCodeGen::CompileToNVRTC(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x1c8 (0x7fa2aa78fc58 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #7: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0x172a (0x7fa2aa794a8a in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x11399f4 (0x7fa2aa7a19f4 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #9: torch::jit::tensorexpr::CreateCodeGen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::tensorexpr::Stmt*, std::vector<torch::jit::tensorexpr::CodeGen::BufferArg, std::allocator<torch::jit::tensorexpr::CodeGen::BufferArg> > const&, c10::Device, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7b (0x7fa2b6222a4b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #10: torch::jit::tensorexpr::TensorExprKernel::compile() + 0x907 (0x7fa2b62b78c7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #11: torch::jit::tensorexpr::TensorExprKernel::TensorExprKernel(std::shared_ptr<torch::jit::Graph> const&) + 0x2e8 (0x7fa2b62b84e8 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x33f0a46 (0x7fa2b6132a46 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x3433919 (0x7fa2b6175919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x3435fe7 (0x7fa2b6177fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x343686b (0x7fa2b617886b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x34369ff (0x7fa2b61789ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x3435fcc (0x7fa2b6177fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #20: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7fa2b616e08b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x341f855 (0x7fa2b6161855 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x341fed9 (0x7fa2b6161ed9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x3433919 (0x7fa2b6175919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #24: <unknown function> + 0x3435fe7 (0x7fa2b6177fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #25: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0x343686b (0x7fa2b617886b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #27: <unknown function> + 0x34369ff (0x7fa2b61789ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x3435fcc (0x7fa2b6177fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #30: <unknown function> + 0x343686b (0x7fa2b617886b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x34369ff (0x7fa2b61789ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #32: <unknown function> + 0x3435fcc (0x7fa2b6177fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #33: <unknown function> + 0x343651a (0x7fa2b617851a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #34: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7fa2b616e08b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x3424318 (0x7fa2b6166318 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #36: <unknown function> + 0x3449701 (0x7fa2b618b701 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #37: <unknown function> + 0x344a18e (0x7fa2b618c18e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #38: <unknown function> + 0x34227d4 (0x7fa2b61647d4 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #39: torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x39 (0x7fa2b5ed73c9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #40: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x143 (0x7fa2b5ee3253 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #41: torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >) + 0xe8 (0x435ef6 in ./test_model)
frame #42: main + 0x1042 (0x42d60f in ./test_model)
frame #43: __libc_start_main + 0xf3 (0x7fa2a92c6493 in /lib64/libc.so.6)
frame #44: _start + 0x2e (0x42c3be in ./test_model)
Aborted (core dumped)
Hi, please try whether this works: https://github.com/yueyericardo/cuaev_cpp
cc @IgnacioJPickering
Thank you for taking the time and working on this issue. I cloned the repo and followed the instructions you wrote. This is the error:
[ 20%] Building CXX object cuaev/CMakeFiles/cuaev.dir/cuaev.cpp.o
[ 40%] Building CUDA object cuaev/CMakeFiles/cuaev.dir/aev.cu.o
/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(799): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined
/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(810): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined
/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(833): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined
/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(957): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined
/projects/cpp_codes/cuaev_cpp/cuaev/aev.cu(1045): error: identifier "C10_CUDA_KERNEL_LAUNCH_CHECK" is undefined
Hi, please check the readme and build again. https://github.com/yueyericardo/cuaev_cpp It is currently working for cuda and cpu when not using cuaev. I'm still checking the cuaev issue.
cuaev also works now on my local machine. let me know if it works on your side. Just curious, what are you using this for?
It is working for me too! We are using this model as an OpenMM external force plugin to run MD simulations. Thank you for getting the cuaev working. I was wondering why loading the CPU model and moving it on CUDA is not working even though all its functions are implemented by Torch?
Hi, if not using cuaev, cpu and gpu model should be fine for both cases. if using cuaev (cuda_extension=True), the model has to been on gpu, because in the begining of the design, https://github.com/aiqm/torchani/blob/master/torchani/cuaev/cuaev.cpp#L153, the tensors of aev_params (EtaR_t, ShfZ_t etc) are assumed already in the gpu if calling cuda_extension==True, and these tensors is not supported to change device in the current code. A workaround now could be save different models on disks, and use different one accordingly.
I am not sure what the problem is, but the GPU model is not working without using cuaev. My goal was to have a CPU model and move it on GPU without dealing with cuaev, but it runs just once. Unfortunately, I was not able to fix this issue. I attached the code. Use save_ani_cpu.py
to create the model.
ani_cuda.zip
please check the commit at https://github.com/yueyericardo/cuaev_cpp/commit/0ebf55a527e92f1e3f56585ffbdd1d0137a43839, it should print out
First call: [2, 5, 384]
Second call: [2, 5, 384]
Thank you! It works! It seems that we have to compile the code with cuaev even when we are not using it. In the folder that I've attached, I am not using cuaev.
Also, your model is still on CPU and has not moved on GPU.
could you try https://github.com/yueyericardo/cuaev_cpp/blob/main/test_model.cpp#L22
model = torch::jit::load(argv[1], device);
instead of model.to(cuda)
?
Same error. Would it be possible to try the codes inside the attached folder below? ani_cuda.zip
Your attched code worked fine. to make it cpu model, I changed the following, it also works.
diff --git a/models/cpu_model.pt b/models/cpu_model.pt
index 29688aa..1d90e3b 100644
Binary files a/models/cpu_model.pt and b/models/cpu_model.pt differ
diff --git a/save_ani_cpu.py b/save_ani_cpu.py
index a8ac783..1d32ef5 100644
--- a/save_ani_cpu.py
+++ b/save_ani_cpu.py
@@ -6,7 +6,7 @@ import torchani
def save_cpu_aev():
- device = torch.device('cuda')
+ device = torch.device('cpu')
tolerance = 5e-5
Rcr = 5.2000e+00
Rca = 3.5000e+00
diff --git a/test_model.cpp b/test_model.cpp
index c656257..b75b89a 100755
--- a/test_model.cpp
+++ b/test_model.cpp
@@ -21,7 +21,7 @@ int main(int argc, const char* argv[]) {
torch::Device device(torch::kCUDA);
try {
- model = torch::jit::load(argv[1]);
+ model = torch::jit::load(argv[1], device);
std::cout << "successfully loaded the model\n";
}
@@ -30,7 +30,7 @@ int main(int argc, const char* argv[]) {
return -1;
}
- model.to(device);
+ // model.to(device);
torch::Tensor coords = torch::tensor({
{{0.03192167, 0.00638559, 0.01301679},
{-0.83140486, 0.39370209, -0.26395324},
what kind of error you gor?
Here is the error:
successfully loaded the model
first call[2, 5, 560]
terminate called after throwing an instance of 'c10::Error'
what(): Error in dlopen or dlsym: libnvrtc-f4909b87.so.11.0: cannot open shared object file: No such file or directory
Exception raised from checkDL at /home/nazanin/programs/pytorch/aten/src/ATen/DynamicLibrary.cpp:23 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f634b5908b5 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7f634b58e021 in /home/nazanin/programs/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0xedf79e (0x7f633f00c79e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #3: at::DynamicLibrary::DynamicLibrary(char const*, char const*) + 0x62 (0x7f633f00cbf2 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xc3d4a2 (0x7f63356904a2 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xc3d8a1 (0x7f63356908a1 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #6: torch::jit::tensorexpr::CudaCodeGen::CompileToNVRTC(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x1c8 (0x7f6335b7ac58 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #7: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0x172a (0x7f6335b7fa8a in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x11399f4 (0x7f6335b8c9f4 in /home/nazanin/programs/libtorch/lib/libtorch_cuda.so)
frame #9: torch::jit::tensorexpr::CreateCodeGen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::tensorexpr::Stmt*, std::vector<torch::jit::tensorexpr::CodeGen::BufferArg, std::allocator<torch::jit::tensorexpr::CodeGen::BufferArg> > const&, c10::Device, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7b (0x7f634160da4b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #10: torch::jit::tensorexpr::TensorExprKernel::compile() + 0x907 (0x7f63416a28c7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #11: torch::jit::tensorexpr::TensorExprKernel::TensorExprKernel(std::shared_ptr<torch::jit::Graph> const&) + 0x2e8 (0x7f63416a34e8 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
63416a34e8 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x33f0a46 (0x7f634151da46 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x3433919 (0x7f6341560919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x3435fe7 (0x7f6341562fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x343686b (0x7f634156386b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x34369ff (0x7f63415639ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x3435fcc (0x7f6341562fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #20: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7f634155908b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x341f855 (0x7f634154c855 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x341fed9 (0x7f634154ced9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #23: <unknown function> + 0x3433919 (0x7f6341560919 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #24: <unknown function> + 0x3435fe7 (0x7f6341562fe7 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #25: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0x343686b (0x7f634156386b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #27: <unknown function> + 0x34369ff (0x7f63415639ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x3435fcc (0x7f6341562fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #30: <unknown function> + 0x343686b (0x7f634156386b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x34369ff (0x7f63415639ff in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #32: <unknown function> + 0x3435fcc (0x7f6341562fcc in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #33: <unknown function> + 0x343651a (0x7f634156351a in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #34: torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) + 0x1d2b (0x7f634155908b in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x3424318 (0x7f6341551318 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #36: <unknown function> + 0x3449701 (0x7f6341576701 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #37: <unknown function> + 0x344a18e (0x7f634157718e in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #38: <unknown function> + 0x34227d4 (0x7f634154f7d4 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #39: torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x39 (0x7f63412c23c9 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #40: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0x143 (0x7f63412ce253 in /home/nazanin/programs/libtorch/lib/libtorch_cpu.so)
frame #41: torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >) + 0xe8 (0x436900 in ./test_model)
frame #42: main + 0x1106 (0x42db93 in ./test_model)
frame #43: __libc_start_main + 0xf3 (0x7f63346b1493 in /lib64/libc.so.6)
frame #44: _start + 0x2e (0x42c87e in ./test_model)
Aborted (core dumped)
have no idea, I'm using the environment created at https://github.com/yueyericardo/cuaev_cpp/blob/main/README.md#build-instruction
Are you using the same? cuda 11.3 latest torch and torchani?
The only difference is that I'm not compiling with cuaev.
I'm not sure, but
what(): Error in dlopen or dlsym: libnvrtc-f4909b87.so.11.0: cannot open shared object file: No such file or directory
this seems to be linked with cuda11.0 library instead of cuda11.3?
Ok. I get the following error when I compiled it with Libtorch from Pytorch Conda installation.
successfully loaded the model
first call[2, 5, 560]
terminate called after throwing an instance of 'std::runtime_error'
what(): The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torchani/aev/___torch_mangle_0.py", line 117, in forward
_38 = torch.cumsum(_36, 0, dtype=None, out=_37)
_39 = torch.index_select(cumsum, 0, pair_indices)
sorted_local_index120 = torch.add_(sorted_local_index12, _39, alpha=1)
~~~~~~~~~~ <--- HERE
_40 = annotate(List[Optional[Tensor]], [sorted_local_index120])
local_index12 = torch.index(rev_indices, _40)
Traceback of TorchScript, original code (most recent call last):
File "/home/nazanin/programs/torchani/torchani/aev.py", line 254, in forward
mask = (torch.arange(intra_pair_indices.shape[2], device=ai1.device) < pair_sizes.unsqueeze(1)).flatten()
sorted_local_index12 = intra_pair_indices.flatten(1, 2)[:, mask]
sorted_local_index12 += cumsum_from_zero(counts).index_select(0, pair_indices)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
# unsort result from last part
RuntimeError: The size of tensor a (46) must match the size of tensor b (51) at non-singleton dimension 1
Aborted (core dumped)
It seems that I've been compiling my code with CUDA11.0 instead of CUDA11.3. Thank you!
Great! you are welcome
Hello,
I am running into an issue when I'm trying to load an ANI CUDA model in C++. I created and saved TorchScript of my ANI model in Python. I don't have any issues with loading the saved TorchScript model in Python. I get the error below when I load it into my C++ script.
I'll be grateful if you help me to get this working.
Error:
Here is my code for reproducing the error. ani_cuda.zip