Out of memory when training

Oshwiciqwq commented 4 months ago

Thanks for your great work. I tried to use the code to finetune on a dataset of 100k images, but got an error. Python error message:

Traceback (most recent call last): File "/mnt/sdc/jinliankai/code/seg1/train.py", line 98, in mp.spawn(main, nprocs=world_size, args=(world_size, train_args, args.port)) File "/root/anaconda3/envs/seg/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/anaconda3/envs/seg/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/anaconda3/envs/seg/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 140, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL (seg) root@ubuntu:/mnt/sdc/jinliankai/code/seg1# /root/anaconda3/envs/seg/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown

And I checked the reason was out of memory:

[Wed Jun 19 04:01:48 2024] Out of memory: Killed process 1906131 (python) total-vm:279474328kB, anon-rss:231214868kB, file-rss:73844kB, shmem-rss:588812kB, UID:0 pgtables:460292kB oom_score_adj:0

It happened after caching images and before training, the log stopped at:

INFO:Agent:Training Phase 0%| | 0/22159 [00:00<?, ?it/s]

However, it worked well when training on 1 GPU on first 2 epochs, and got the same error on 3rd epoch. During training, the process used about 20% memory at most times, but sometimes it raised to 40% or more. My machine has 500GB memory in total. I wonder if the code has some memory leak bugs, or if the RAM of my machine is insufficient. Sorry to bother you. I am new to machine learning and would really appreciate your help.

SteveImmanuel commented 4 months ago

Hi thanks for informing the issue. I am currently at conference so I will take a look as soon as possible and get back to you. Sorry for the inconvenience

SteveImmanuel commented 4 months ago

I implemented the current dataloader to load all the labels and images once at initialization to improve I/O speed. This will in turn load all of them to your RAM.

When you are using more than 1 GPU, each GPU will spawn a new process and the cache are not shared between processes. So, if one process uses 100GB of RAM, then running on 4 GPUs will require ~400GB of RAM.

This behaviour can be changed in the data.py. You can disable the _preload_dataset and instead load each sample on the fly in the __getitem__

SteveImmanuel / SegGPT-FineTune

Out of memory when training #2