GokuMohandas / Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.
https://madewithml.com
MIT License
37.03k stars 5.87k forks source link

trainer.fit() Object Memory Error in Local Machine #253

Open satyamnyati opened 7 months ago

satyamnyati commented 7 months ago

I run out of memory in trainer.fit(), I have 8gb RAM and i7 8th gen 12 core CPU. I also see error in trainer.fit(). Is there any way to reduce the load on RAM or will I need another RAM to be able to run this.

totovivi commented 6 months ago

These changes helped me:

But I am surprised that I had to do these changes, given my computer's characteristics

satyamnyati commented 6 months ago

Yes, I changed the batch size and I could get it to run on ubuntu. However in windows I couldn't run it even with these changes. The path manipulations are causing some problems.

capmichal commented 6 months ago

Same issue here, on both laptops (powerfull Dell XPS 15, and weaker HP from work). After using num_workers=1 and batch_size=32 I could run just fine, it took about 30 minutes to train model.

If I work on a single machine, does the num_workers affect this OOM problem, or all i have to do is manipulate batch_size variable ? Does changing to 2 workers allow me to use larger batch size? Or does my computer(each core) just cannot handle more that batch_size=32 ?

Would love to get explanation about how does num_workers, resources_per_worker and batch_size relate to my RAM.

In Addition: Using GPU on my Dell solves this issue, and trains model in a minute, but i am interested in using only CPU. While running the whole setup my computer RAM usage is about 70% which leaves about 4/5GB of RAM for training, it is simply not enough, so any configuration (if not using GPU) wont allow my to easily train?

Koowah commented 4 months ago

@capmichal @totovivi I had the same reactions and questions.

After a bit of research and chatting with GPT, here's what I gathered :

So I guess the main issue regarding memory is how much data is passed by iteration (batch_size * num_workers). Using less workers should however be better as each worker has overhead and must load its own copy of the model.