Closed boundles closed 6 years ago
Hi! Whats the command you are using to run the code? Can you show the whole text of the error you are getting?
----- TRAINING - EPOCH 1 -----
LEARNING RATE: 0.0005
loss: 0.2037 (epoch: 1, step: 0) // Avg time/img: 1.2726 s
loss: 0.1052 (epoch: 1, step: 50) // Avg time/img: 0.0670 s
loss: 0.08723 (epoch: 1, step: 100) // Avg time/img: 0.0545 s
loss: 0.07954 (epoch: 1, step: 150) // Avg time/img: 0.0504 s
loss: 0.07391 (epoch: 1, step: 200) // Avg time/img: 0.0483 s
loss: 0.06965 (epoch: 1, step: 250) // Avg time/img: 0.0471 s
loss: 0.06608 (epoch: 1, step: 300) // Avg time/img: 0.0462 s
loss: 0.06343 (epoch: 1, step: 350) // Avg time/img: 0.0456 s
loss: 0.06115 (epoch: 1, step: 400) // Avg time/img: 0.0451 s
loss: 0.05919 (epoch: 1, step: 450) // Avg time/img: 0.0448 s
loss: 0.05704 (epoch: 1, step: 500) // Avg time/img: 0.0445 s
loss: 0.05525 (epoch: 1, step: 550) // Avg time/img: 0.0443 s
loss: 0.05384 (epoch: 1, step: 600) // Avg time/img: 0.0440 s
loss: 0.05241 (epoch: 1, step: 650) // Avg time/img: 0.0439 s
loss: 0.05142 (epoch: 1, step: 700) // Avg time/img: 0.0437 s
loss: 0.05007 (epoch: 1, step: 750) // Avg time/img: 0.0436 s
Traceback (most recent call last):
File "main.py", line 508, in
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --savedir erfnet_training --datadir /home/darren.wy/corpus/road_lane/ --num-epochs 100 --batch-size 16 --model "erfnet" --decoder --pretrainedEncoder "../save/erfnet_training/model_best_enc.pth.tar"
I can't reproduce this error in my code since you seem to be using different data and code, but here are some things that I would try: Are you getting this error as well if you use 1 single gpu with a smaller batch? Have you tried pytorch 0.3? (I had some problems with running the code using source-pytorch 0.4 so I would wait until that version is released) Is the dataset ok? The fact that you are getting the error after 750 batches makes it seem related to the data. Maybe the dataloader is loading a list and not all images are there? Maybe 1 image is corrupted? If you run the command again, are you getting the same error always at iteration 750 or at different iterations?
Yes, I tried Pytorch 0.3 and it's ok. Thanks very much.
My Pytorch Version: 0.4.0 CUDA: 8.0 Python Version: 3.6 Multiple GPU
Could you give me some suggestions? Thanks a lot