Closed CharlieCheckpt closed 2 years ago
@CharlieCheckpt I will try to repro this and get back to you. As I'm sure you know these memory issues are hard to debug, so I would recommend as a short-term solution requesting more RAM if possible.
Can you please send full environment information?
CUDA version, package version list, python version, OS, etc.,
you can also try 1) lowering the #dataloader workers 2) use MMAP when loading data
Hello, thank you for your answers !
@prigoyal indeed, there were too many dataloader workers specified in the config used to resume the experiment.
NUM_DATALOADER_WORKERS
was 5 in my original experiment, was 10 when resuming the experiment.
Reducing the number of workers solved the problem.
Can you explain why this parameter has an impact on RAM usage ?
Actually, what I said above is wrong.
For another experiment, I still encounter the issue while I had NUM_DATALOADER_WORKERS
=3 during the original script launch and NUM_DATALOADER_WORKERS
=3 when resuming the training. I tried to resume with NUM_DATALOADER_WORKERS=2
but still encounter the issue.
EDIT : I also looked at mmap and MMAP_MODE
is already equal to True in my config.
EDIT 2: For this new experiment I have more images and more RAM : 43 million images, and 720GB of RAM on each machine (2 machines with 8 GPUs). I can't increase more the memory. The large number of images is probably causing the OOM. Do you see any other way to reduce the memory consumption when resuming the training ?
Hello ! If this can be of any help, the line where the RAM increases dramatically (+500Go or more) is :
self.data_iterator = iter(self.dataloaders[phase_type])
located here.
I am still trying to understand what is going on with this iter()
, and why there is no issue during first training, but there is one when resuming from a checkpoint.
@CharlieCheckpt Sorry for the delay here. Have you made any additional progress here?
Can you please send full environment information from:
wget -nc -q https://github.com/facebookresearch/vissl/raw/main/vissl/utils/collect_env.py && python collect_env.py
I've never had this problem before -- so I'm a bit worried that it may be a Pytorch issue specific to your env. Let's try to bisect the problem.
Can you also send the full configs from both the train and eval environment?
Hi @iseessel , thanks for following-up.
It is also difficult for me to debug these costly experiments, because I'm working on a server with limited credits.
What I can tell you for now is that I couldn't reproduce this RAM increase on a personal server (with less images : 1M instead of 43M). I ran other experiments recently and did not notice this behaviour anymore, so difficult to know what is going on.
Maybe we can close this issue and re-open if it happens again ?
Yeah sounds good -- there are lots of potential complicated dynamics at play here and unless necessary, I wouldn't recommend going down a rabbit hole on this.
Hi,
I tried to resume the training of a mocov2 experiment from a given checkpoint, but got a CPU OOM . This is weird because I did not encounter this issue in the initial run, and I have the exact same resources. It seems linked to #235 but I couldn't find a way to make it work. Do you have any idea what can cause this ? And how to solve it ?
Instructions To Reproduce the 🐛 Bug:
what changes did you made None
what exact command you run:
And the .err log of one of the node which is probably the most important:
Train a model such as resnet50 with mocov2 and get a checkpoint Resume training from the checkpoint and look at the CPU memory usage.
Expected behavior:
I would expect that to resume the training, the same amount of memory is taken than during the initial training.
Environment:
4 Gpus of 32G, on 4 nodes with 180G of RAM. I did not run the environment command because I am using SLURM that distribute the training on other machines.