The odd part is that I cannot seem to reproduce this error. I have tried it on two different machines, with different epoch starting points and training folders. In one test the only thing I changed was epoch 80 start to epoch 8 so I wouldn't have to wait 2 days to check, and it worked fine. Without being able to reproduce the error, I'm having trouble debugging it.
Twice now a training experiment has failed at the beginning of phase 3 loss at epoch 80. The error is below:
Comet experiments: https://www.comet.ml/permobil-research/fastdepth/a4897c086bfe40b1a630df6792d17670?experiment-tab=chart&showOutliers=true&smoothing=0&transformY=smoothing&xAxis=step https://www.comet.ml/permobil-research/fastdepth/0a9ebcf8078a488487f39b2aff633339?experiment-tab=chart&showOutliers=true&smoothing=0&transformY=smoothing&xAxis=step
The odd part is that I cannot seem to reproduce this error. I have tried it on two different machines, with different epoch starting points and training folders. In one test the only thing I changed was epoch 80 start to epoch 8 so I wouldn't have to wait 2 days to check, and it worked fine. Without being able to reproduce the error, I'm having trouble debugging it.