FunctionLab / sei-framework

code to run sei and obtain sei and sequence class predictions
Other
92 stars 6 forks source link

model training error #19

Open guandailu opened 1 year ago

guandailu commented 1 year ago

Command:

python -u ./selene/selene_sdk/cli.py train.yml --lr=0.1

Error information: Traceback (most recent call last): File "train.py", line 11, in parse_configs_and_run(configs, lr=0.01) File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/utils/config_utils.py", line 344, in parse_configs_and_run execute(operations, configs, current_run_output_dir) File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/utils/config_utils.py", line 188, in execute train_model.train_and_validate() File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/train_model.py", line 417, in train_and_validate self.train() File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/selene_sdk/train_model.py", line 453, in train loss.backward() File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (try_all at /opt/conda/conda-bld/pytorch_1591914855613/work/aten/src/ATen/native/cudnn/Conv.cpp:693) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x14c3d6230b5e in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0xd5d68d (0x14c3d775d68d in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd5e1d1 (0x14c3d775e1d1 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xd6220b (0x14c3d776220b in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #4: at::native::cudnn_convolution_backward_input(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0xb2 (0x14c3d7762762 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #5: + 0xdc9280 (0x14c3d77c9280 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0xe0db18 (0x14c3d780db18 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #7: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x4fa (0x14c3d7763dfa in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #8: + 0xdc95ab (0x14c3d77c95ab in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #9: + 0xe0db74 (0x14c3d780db74 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #10: + 0x29dee26 (0x14c4043dee26 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #11: + 0x2a2e634 (0x14c40442e634 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #12: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x378 (0x14c403ff6ff8 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #13: + 0x2ae7df5 (0x14c4044e7df5 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #14: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x14c4044e50f3 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #15: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x14c4044e5ed2 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #16: torch::autograd::Engine::thread_init(int) + 0x39 (0x14c4044de549 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #17: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x14c407f0a638 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #18: + 0xd3e79 (0x14c41efd3e79 in /home/user/.conda/envs/selene_sdk/lib/python3.7/site-packages/matplotlib/../../../libstdc++.so.6) frame #19: + 0x94b43 (0x14c42c894b43 in /lib/x86_64-linux-gnu/libc.so.6) frame #20: + 0x126a00 (0x14c42c926a00 in /lib/x86_64-linux-gnu/libc.so.6)

kathyxchen commented 1 year ago

Hi, from the error I wonder if your GPU may not have enough memory to accommodate our model? Maybe try reducing the batch size? Usually we run on a100s with 40GB GPU memory. Sorry for the late response, happy to try to troubleshoot further.