ChanganVR / RelationalGraphLearning

[IROS20] Relational graph learning for crowd navigation
132 stars 41 forks source link

memory leak for multi-human policies #22

Open huiwenzhang opened 1 year ago

huiwenzhang commented 1 year ago

Hi, when running the multi-human policy, such as sarl, lstm-rl, I noticed that there is drastic memory increase with training going on. The used memory increased from about 4G to 20G after 100 episodes training. I debug for a long time, but still no clue about what's going wrong there. @ChanganVR Pls have a look.

ChanganVR commented 1 year ago

@huiwenzhang Not such issue has been reported before. Maybe you could check whether your pytorch and CUDA version are compatible. Sometimes that could have an effect on the memory consumption.

huiwenzhang commented 1 year ago

@huiwenzhang Not such issue has been reported before. Maybe you could check whether your pytorch and CUDA version are compatible. Sometimes that could have an effect on the memory consumption.

I used pytorch version 2.0.1 with cuda version 11.8. The local cuda version is 12.1. According to the official doc of pytorch, newer cuda version is also supported. Besides, I didn't use GPU as you suggested. But the problem still exist. Training with cadrl and rgl policy is fine. Do you have any other guess about the memory leak?

ChanganVR commented 1 year ago

@huiwenzhang I see. I don't have a clue what could be causing the issue. You could debug by removing all codes and adding back parts by parts until the issue occurs.