Training falls on phase 3 loss at epoch 80

Twice now a training experiment has failed at the beginning of phase 3 loss at epoch 80. The error is below:

Traceback (most recent call last):
  File "train.py", line 406, in <module>
    main(args)
  File "train.py", line 390, in main
    model, criterion, optimizer, scheduler, experiment)
  File "train.py", line 254, in train
    loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/comet_ml/monkey_patching.py", line 293, in wrapper
    return_value = original(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: select(): index 0 out of range for tensor of size [0, 4] at dimension 0
Exception raised from select at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/ATen/native/TensorShape.cpp:889 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f48ae2bc77d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::native::select(at::Tensor const&, long, long) + 0x347 (0x7f48e1334ff7 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0xfe3789 (0x7f48e1719789 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xfd6a83 (0x7f48e170ca83 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::select(at::Tensor const&, long, long) + 0xe0 (0x7f48e163f0f0 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2b62186 (0x7f48e3298186 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xfd6a83 (0x7f48e170ca83 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::select(long, long) const + 0xe0 (0x7f48e17ca240 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2a6d69d (0x7f48e31a369d in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::generated::MaxBackward1::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x188 (0x7f48e31bd0d8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x30d1017 (0x7f48e3807017 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f48e3802860 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f48e3803401 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f48e37fb579 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f48e7b2a99a in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0xc819d (0x7f48fe95719d in /opt/conda/bin/../lib/libstdc++.so.6)
frame #16: <unknown function> + 0x76db (0x7f490ec776db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #17: clone + 0x3f (0x7f490e9a0a3f in /lib/x86_64-linux-gnu/libc.so.6)

Comet experiments: https://www.comet.ml/permobil-research/fastdepth/a4897c086bfe40b1a630df6792d17670?experiment-tab=chart&showOutliers=true&smoothing=0&transformY=smoothing&xAxis=step https://www.comet.ml/permobil-research/fastdepth/0a9ebcf8078a488487f39b2aff633339?experiment-tab=chart&showOutliers=true&smoothing=0&transformY=smoothing&xAxis=step

The odd part is that I cannot seem to reproduce this error. I have tried it on two different machines, with different epoch starting points and training folders. In one test the only thing I changed was epoch 80 start to epoch 8 so I wouldn't have to wait 2 days to check, and it worked fine. Without being able to reproduce the error, I'm having trouble debugging it.

appliedinnovation / fast-depth

Training falls on phase 3 loss at epoch 80 #20