About Memory leak for the train_telephone split on Replica

facebookresearch / sound-spaces

A first-of-its-kind acoustic simulation platform for audio-visual embodied AI research. It supports training and evaluating multiple tasks and applications.

https://soundspaces.org

Creative Commons Attribution 4.0 International

322 stars 55 forks source link

About Memory leak for the train_telephone split on Replica #101

Open chenjinyubuaa opened 1 year ago

chenjinyubuaa commented 1 year ago

Hello! I meet a memory leak problem when train the av_wan/av_nav model on the train_telephone split of 1.0 environment. 1.The program takes increasingly memory space during training. At about 2000 update steps, it will take over 100G RAM, and the program stops for not enough RAM. It appear stably under different torch version. 2.This bug does not appear in the train_telephone on MP3D or the train_multipe on Replica. 3.I have already tried to roll back to the soundspaces 1.0, but not work. do you have any idea of it?

ChanganVR commented 1 year ago

I actually did run into this issue multiples times. There were mainly two reasons for me: 1. mismatch between pytorch and CUDA version and reinstalling pytorch with right version fixed it and 2. sometimes setting torch.set_num_threads(1) in the main thread worked for me.

Another way to debug it is to set USE_SYNC_ENV=True, which disables multi-process spawning of the environment and from there, you can play around with the number of threads and see if multi-process is an issue.

Star-down commented 2 months ago

@ChanganVR May I ask where should I set the command 'USE_SYNC_ENV=True'?

ChanganVR commented 1 month ago

@Star-down you can set the statement at the top level training/test config: https://github.com/facebookresearch/sound-spaces/blob/287184fd7067a0385558492716355c54875500ee/ss_baselines/av_wan/config/audionav/mp3d/test_with_am.yaml#L9