LuxDL / Lux.jl

Elegant & Performant Scientific Machine Learning in Julia
https://lux.csail.mit.edu/
MIT License
479 stars 57 forks source link

Irregular RAM usage under large amount of epochs on gpu #872

Open jdksjfisdf opened 1 week ago

jdksjfisdf commented 1 week ago

I was doing my learning which needs many single_train_steps on my gpu(NVIDIA GeForce RTX 2060 Mobile) when I noticed the irregular RAM(not video RAM) usage. I tested by modifying the epochs of https://lux.csail.mit.edu/stable/tutorials/beginner/2_PolynomialFitting from 250 to 2 500 000 and the RAM usage(of the single process, provided by kde system monitor) is increasing by time still it's up to 4.6 GB. The same issue does not happen if I disable LuxCUDA and run it on cpu. I think there is a memory leak.

avik-pal commented 1 week ago

I can reproduce this, but I don't think it is a memory leak. It is probably just Julia not freeing memory that it doesn't need to. I tried adding a GC.gc(true) at the end of the run and it was able to free all the memory, which (I think) wouldn't have been the case if it was a memory leak

Though 4.6 GB seems extremely high. I am running the job with very limited available memory (~2GB) and then the memory usage saturates at a certain point. Can you try adding a GC.gc(true) at the end of every epoch and see if the memory usage still grows?

jdksjfisdf commented 1 week ago

I tested again. I was using jupyter lab and GC.gc(true) did not work for me. I runed GC.gc(true) every 50000 epochs and in the end. The memory usage is 4.5GB in the end. I don't know if jupyter or others matters.