LambdaLabsML / examples

Deep Learning Examples
MIT License
805 stars 103 forks source link

Stopped during 64th epoch with no error message #47

Closed rorycochrane closed 1 year ago

rorycochrane commented 1 year ago

I tried running the code on a lamdalabs A100 instance and it stopped in the middle of the 64th epoch. There is no error message or anything, so maybe there was an issue with the instance rather than the code? Maybe a memory issue or something?

I made some small changes, so maybe that was the cause of the issue. I modified these settings in order to accommodate the different GPU size: BATCH_SIZE = 2 N_GPUS = 1 ACCUMULATE_BATCHES = 4