0-5788719150923125 / vtx

an experiment
Other
7 stars 1 forks source link

Training with HF datasets has a severe memory leak #30

Open Vectorrent opened 2 months ago

Vectorrent commented 2 months ago

I have been training with AIGen in Kaggle notebooks, and I'm running into an issue where CPU memory is slowly increasing, over the course of several hours. Before long, the notebook goes OOM, and training crashes.

I'm not sure where the leak is happening. I do know that it's not in VTX (it's in AIGen), and it's not leaking VRAM (it leaks system RAM). I suspect it has something to do with the streaming dataloaders (because they are the only ones I'm using here), but I haven't had the bandwidth to troubleshoot yet.