mpirun noticed that process rank 1 exited on signal 9 (Killed)

Kay1794 / Aware-of-the-history

Official PyTorch code for our ECCV'22 paper Collaborative Uncertainty in Multi-Agent Trajectory Forecasting.

16 stars 2 forks source link

mpirun noticed that process rank 1 exited on signal 9 (Killed) #2

Open jc3342 opened 1 year ago

jc3342 commented 1 year ago

Hi, thanks for your work, which is great! I tried to run the code but got this error:

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun noticed that process rank 1 with PID xxxxxx on node xxxxxxxx exited on signal 9 (Killed).

I tried to decrease the batch size, and checked the memory of gpu, but it did not work. Do you know how I can fix it? Thanks again!

Kay1794 commented 1 year ago

Hi @jc3342, thank you for your interest. I doubted it was caused by some incompatibilities when setting up horovod. Here are two things you may want to try:

Run the code in single gpu mode, for example, the single GPU command of training would be: python train.py --nepoch 36 --comment lba --model lanegcn_lba --behavior_root PATH_TO_BEHAVIOR_DATABASE you may change the batch size accordingly.
I'm not sure about your running environment, but I suggest you check the version of your openmpi and other horovod-related packages. One link you may refer to is https://medium.com/@luca.diliello/build-horovod-from-source-on-linux-systems-428a5e5fa729

jc3342 commented 1 year ago

Hi, thanks for your reply. I tried to run the code in single GPU mode with the script python train.py --nepoch 36 --comment lba --model lanegcn_lba --behavior_root PATH_TO_BEHAVIOR_DATABASE. And decreased the batch size from 32 to 16. It was killed after 1hour40mins running (always be killed at the iteration 6434). I checked the CPU usage by htop and the reason is that swp is out of memory. Do you have any suggestions? Thank you!

My running environment is: python 3.7.15 pytorch 1.8.1 cudatoolkit 11.1.1 openmpi 4.1.4 horovod 0.19.4 mpi4py 3.1.4

jc3342 commented 1 year ago

Hi, I am sorry that I have another question: could you please release the code to generate preprocessed Argoverse Behavior database? Thank you so much!

Kay1794 commented 1 year ago

Hi, I am sorry that I have another question: could you please release the code to generate preprocessed Argoverse Behavior database? Thank you so much!

Thank you for the suggestion. I am working on it. I realize it could be a better option than downloading a huge file.

Kay1794 commented 1 year ago

Hi, thanks for your reply. I tried to run the code in single GPU mode with the script python train.py --nepoch 36 --comment lba --model lanegcn_lba --behavior_root PATH_TO_BEHAVIOR_DATABASE. And decreased the batch size from 32 to 16. It was killed after 1hour40mins running (always be killed at the iteration 6434). I checked the CPU usage by htop and the reason is that swp is out of memory. Do you have any suggestions? Thank you!

My running environment is: python 3.7.15 pytorch 1.8.1 cudatoolkit 11.1.1 openmpi 4.1.4 horovod 0.19.4 mpi4py 3.1.4

I am not sure if it was caused by the swap memory limitation. Have you tried setting the number of workers of the dataloader to 0?

jc3342 commented 1 year ago

The problem was solved. It is because I do not have enough CPU memory for the training. Thanks!