Open guandailu opened 1 year ago
Hi, from the error I wonder if your GPU may not have enough memory to accommodate our model? Maybe try reducing the batch size? Usually we run on a100s with 40GB GPU memory. Sorry for the late response, happy to try to troubleshoot further.
Command:
python -u ./selene/selene_sdk/cli.py train.yml --lr=0.1
Error information: Traceback (most recent call last): File "train.py", line 11, in
parse_configs_and_run(configs, lr=0.01)
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/utils/config_utils.py", line 344, in parse_configs_and_run
execute(operations, configs, current_run_output_dir)
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/utils/config_utils.py", line 188, in execute
train_model.train_and_validate()
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/train_model.py", line 417, in train_and_validate
self.train()
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/train_model.py", line 453, in train
loss.backward()
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (try_all at /opt/conda/conda-bld/pytorch_1591914855613/work/aten/src/ATen/native/cudnn/Conv.cpp:693)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x14c3d6230b5e in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xd5d68d (0x14c3d775d68d in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd5e1d1 (0x14c3d775e1d1 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd6220b (0x14c3d776220b in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::cudnn_convolution_backward_input(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0xb2 (0x14c3d7762762 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xdc9280 (0x14c3d77c9280 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xe0db18 (0x14c3d780db18 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x4fa (0x14c3d7763dfa in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: + 0xdc95ab (0x14c3d77c95ab in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xe0db74 (0x14c3d780db74 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0x29dee26 (0x14c4043dee26 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x2a2e634 (0x14c40442e634 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x378 (0x14c403ff6ff8 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x2ae7df5 (0x14c4044e7df5 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x14c4044e50f3 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x14c4044e5ed2 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_init(int) + 0x39 (0x14c4044de549 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x14c407f0a638 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #18: + 0xd3e79 (0x14c41efd3e79 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/matplotlib/../../../libstdc++.so.6)
frame #19: + 0x94b43 (0x14c42c894b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: + 0x126a00 (0x14c42c926a00 in /lib/x86_64-linux-gnu/libc.so.6)