Closed peiyunh closed 4 years ago
Also, what FPS should we expect to reach during training? With num_workers=0
, I have about 0.4FPS for training and 0.8 FPS for validation. At this speed, the phase 1 of training (256 epochs) takes 6 days to finish. Does that sound too slow? How long in your experience does it take to train for each phase? Thanks!
What pytorch version do you have?
I tried both 1.5.1 and 1.0.0. Both run into this error when num_workers
is set above zero.
That's odd. I was using Ubuntu 14.04/16.04 with Python 3.5 + PyTorch 1.2 when working on this project, never run into this issue. Maybe try upgrading python to 3.6?
Would I need a new egg file for using Python 3.6?
The 3.5 egg should be compatible with 3.6
Thanks so much @dianchen96 . Switching to Python 3.6 solves the issue. I am now able to train with num_workers=8
. There seems to be a 4x speed up. This means phase 1 training will likely take 1.5 days to finish. Does that sound right to you?
That looks good. I'd recommend first trying lower epoch (e.g. 32) phase 1 model and see how they work. The phase 1 numbers listed on index.md come from a 32 model. P.S you might need to slightly tune the steering PID parameters.
Great to know. Will try that.
Do you by any chance plan to release a checkpoint model for each phase? I am very interested in reproducing the perforamnce and running diagnostics on the intermediate models. Having a reference would be really helpful for me to make sure I am on the right track.
We have released our birdview and phase 2 checkpoints, and we do not benchmark phase 0 model as its sole purpose is to make sure the gradient for phase 1 do not go NaN (due to the reprojection). For phase 1 model performance you can refer to the one on index.md.
Hi @dianchen96 and @bradyz
I am at the stage 0 of training an image agent. There is a runtime error that looks related to a bug of PyTorch with Python 3.5. I am able to train once I set
num_workers=0
but I am wondering if you know another way around that does not sacrifice training speed. Thanks!Please find the error messages below.