Closed stuartcrobinson closed 2 years ago
actually something weird is happening... i think there is a memory leak somewhere. i'm getting the memory error after starting and immediately stopping training successively a few times. but not when running a single training command for a long time... i'm going to close this for now and keep investigating....
never experienced it before. I had like 100 or more runs with early stopping using ctrl+c. If you can give more details it would be nice.
nevermind. sorry about this. i was mistakenly ending training using ctrl-z which left stuff running in the background. thought i saw some docs recommend ctrl-z but i mis-read. i am still curious as to why it has to initialize cuda even when everything is set to use cpu but it doesn't affect me anymore. thank you for an amazing repo btw.
i'm running this command to "play" my trained model without using the gpu:
but i still get this CUDA memory error sometimes if i try to run this while a model is being trained in a different terminal window:
i asked in the nvidia forum too but thought i would check here if it's an unavoidable rl_games thing
https://forums.developer.nvidia.com/t/play-a-checkpoint-file-without-using-gpu-at-all-to-avoid-memory-errors/212764
also, the memory error persists until i reboot. is that a memory leak? or is there any way rl_games could clear the gpu memory?