trainer.fit() Object Memory Error in Local Machine

GokuMohandas / Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

https://madewithml.com

MIT License

37.03k stars 5.87k forks source link

trainer.fit() Object Memory Error in Local Machine #253

Open satyamnyati opened 7 months ago

satyamnyati commented 7 months ago

I run out of memory in trainer.fit(), I have 8gb RAM and i7 8th gen 12 core CPU. I also see error in trainer.fit(). Is there any way to reduce the load on RAM or will I need another RAM to be able to run this.

totovivi commented 6 months ago

These changes helped me:

setting num_workers to 1
setting ray.init(object_store_memory=10**9)
changing the batch_size to 32

But I am surprised that I had to do these changes, given my computer's characteristics

satyamnyati commented 6 months ago

Yes, I changed the batch size and I could get it to run on ubuntu. However in windows I couldn't run it even with these changes. The path manipulations are causing some problems.

capmichal commented 6 months ago

Same issue here, on both laptops (powerfull Dell XPS 15, and weaker HP from work). After using num_workers=1 and batch_size=32 I could run just fine, it took about 30 minutes to train model.

If I work on a single machine, does the num_workers affect this OOM problem, or all i have to do is manipulate batch_size variable ? Does changing to 2 workers allow me to use larger batch size? Or does my computer(each core) just cannot handle more that batch_size=32 ?

Would love to get explanation about how does num_workers, resources_per_worker and batch_size relate to my RAM.

In Addition: Using GPU on my Dell solves this issue, and trains model in a minute, but i am interested in using only CPU. While running the whole setup my computer RAM usage is about 70% which leaves about 4/5GB of RAM for training, it is simply not enough, so any configuration (if not using GPU) wont allow my to easily train?

Koowah commented 4 months ago

@capmichal @totovivi I had the same reactions and questions.

After a bit of research and chatting with GPT, here's what I gathered :

When we're training a model, we have to load it in memory. Here, we're using scibert which weighs around 442mb according the HG page for the model. This weight roughly corresponds to the weight of all the model's parameters (110 million).
One foward pass of one sample loads in memory all the network's associated activations and gradients. When we pass a batch, we have to multiply this load by the batch size. Memory load is therefore a big function of model size, and increases linearly with batch size.

So I guess the main issue regarding memory is how much data is passed by iteration (batch_size * num_workers). Using less workers should however be better as each worker has overhead and must load its own copy of the model.