Open jc3342 opened 1 year ago
Hi @jc3342, thank you for your interest. I doubted it was caused by some incompatibilities when setting up horovod. Here are two things you may want to try:
Run the code in single gpu mode, for example, the single GPU command of training would be:
python train.py --nepoch 36 --comment lba --model lanegcn_lba --behavior_root PATH_TO_BEHAVIOR_DATABASE
you may change the batch size accordingly.
I'm not sure about your running environment, but I suggest you check the version of your openmpi and other horovod-related packages. One link you may refer to is https://medium.com/@luca.diliello/build-horovod-from-source-on-linux-systems-428a5e5fa729
Hi, thanks for your reply.
I tried to run the code in single GPU mode with the script python train.py --nepoch 36 --comment lba --model lanegcn_lba --behavior_root PATH_TO_BEHAVIOR_DATABASE
. And decreased the batch size from 32 to 16. It was
killed after 1hour40mins running (always be killed at the iteration 6434). I checked the CPU usage by htop and the reason is that swp is out of memory. Do you have any suggestions? Thank you!
My running environment is: python 3.7.15 pytorch 1.8.1 cudatoolkit 11.1.1 openmpi 4.1.4 horovod 0.19.4 mpi4py 3.1.4
Hi, I am sorry that I have another question: could you please release the code to generate preprocessed Argoverse Behavior database? Thank you so much!
Hi, I am sorry that I have another question: could you please release the code to generate preprocessed Argoverse Behavior database? Thank you so much!
Thank you for the suggestion. I am working on it. I realize it could be a better option than downloading a huge file.
Hi, thanks for your reply. I tried to run the code in single GPU mode with the script
python train.py --nepoch 36 --comment lba --model lanegcn_lba --behavior_root PATH_TO_BEHAVIOR_DATABASE
. And decreased the batch size from 32 to 16. It was killed after 1hour40mins running (always be killed at the iteration 6434). I checked the CPU usage by htop and the reason is that swp is out of memory. Do you have any suggestions? Thank you!My running environment is: python 3.7.15 pytorch 1.8.1 cudatoolkit 11.1.1 openmpi 4.1.4 horovod 0.19.4 mpi4py 3.1.4
I am not sure if it was caused by the swap memory limitation. Have you tried setting the number of workers of the dataloader to 0?
The problem was solved. It is because I do not have enough CPU memory for the training. Thanks!
Hi, thanks for your work, which is great! I tried to run the code but got this error:
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun noticed that process rank 1 with PID xxxxxx on node xxxxxxxx exited on signal 9 (Killed).
I tried to decrease the batch size, and checked the memory of gpu, but it did not work. Do you know how I can fix it? Thanks again!