CUDA error: out of memory

Monica9502 commented 4 years ago

I'm training XLM model with MLM+TLM for english and spanish, however I got the oom error. The log is as followed: INFO - 03/05/20 16:22:31 - 0:02:34 - Number of parameters (model): 3834768422 INFO - 03/05/20 16:22:41 - 0:02:44 - Found 0 memories. INFO - 03/05/20 16:22:41 - 0:02:44 - Found 12 FFN. INFO - 03/05/20 16:22:41 - 0:02:44 - Found 198 parameters in model. Traceback (most recent call last): File "train.py", line 327, in main(params) File "train.py", line 240, in main trainer = SingleTrainer(model, data, params) File "/data/mayili/nlp/XLM/src/trainer.py", line 800, in init super().init(data, params) File "/data/mayili/nlp/XLM/src/trainer.py", line 66, in init self.set_optimizers() File "/data/mayili/nlp/XLM/src/trainer.py", line 166, in set_optimizers self.optimizers['model'] = get_optimizer(self.parameters['model'], params.optimizer) File "/data/mayili/nlp/XLM/src/optim.py", line 270, in get_optimizer return optim_fn(parameters, **optim_params) File "/data/mayili/nlp/XLM/src/optim.py", line 40, in init state['exp_avg_sq'] = torch.zeros_like(p.data) RuntimeError: CUDA error: out of memory

And my command is export CUDA_VISIBLE_DEVICES=3,4 export CUDA_LAUNCH_BLOCKING=1 export NGPU=2; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name xlm_en_es --dump_path ./dumped --data_path ./data/processed/XLM_en_es/50k --lgs 'en-es' --clm_steps '' --mlm_steps 'en,es,en-es' --emb_dim 1024 --n_layers 12 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 8 --bptt 32 --optimizer adam,lr=0.0001 --epoch_size 10000 --max_epoch 100000 --validation_metrics _valid_mlm_ppl --stopping_criterion _valid_mlm_ppl,15 --fp16 true --amp 1 --tokens_per_batch 100 --max_batch_size 16

I have also tried 4 GPU with 32GB, but it still doesn't work. Can you give some tips?

saikoneru commented 4 years ago

Try decreasing the embedding dimension to 512

Tikquuss commented 3 years ago

@arjunkoneru wouldn't it be better to reduce the batch_size? This is what I did on my side to not change the general structure of the model.

colmantse commented 3 years ago

I actually have a similar issue. the only thing is i didnt got CUDA error, but neither does it log further.

my setting is 8 v100 gpu

the end of the log before it freezes is: `INFO - 01/30/21 18:21:43 - 0:00:39 - Number of parameters (model): 64889361 INFO - 01/30/21 18:21:44 - 0:00:40 - Found 0 memories. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 6 FFN. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 102 parameters in model. INFO - 01/30/21 18:21:44 - 0:00:40 - Using nn.parallel.DistributedDataParallel ... INFO - 01/30/21 18:21:44 - 0:00:40 - Found 0 memories. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 6 FFN. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 0 memories. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 6 FFN. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 102 parameters in model. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 102 parameters in model. INFO - 01/30/21 18:21:44 - 0:00:40 - Using nn.parallel.DistributedDataParallel ... INFO - 01/30/21 18:21:44 - 0:00:40 - Using nn.parallel.DistributedDataParallel ... INFO - 01/30/21 18:21:44 - 0:00:40 - Found 0 memories. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 6 FFN. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 0 memories. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 6 FFN. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 0 memories. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 6 FFN. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 102 parameters in model. INFO - 01/30/21 18:21:44 - 0:00:40 - Using nn.parallel.DistributedDataParallel ... INFO - 01/30/21 18:21:44 - 0:00:40 - Found 0 memories. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 6 FFN. INFO - 01/30/21 18:21:44 - 0:00:40 - Found 102 parameters in model. INFO - 01/30/21 18:21:44 - 0:00:40 - Using nn.parallel.DistributedDataParallel ... INFO - 01/30/21 18:21:44 - 0:00:40 - Found 102 parameters in model. INFO - 01/30/21 18:21:44 - 0:00:40 - Using nn.parallel.DistributedDataParallel ... INFO - 01/30/21 18:21:44 - 0:00:40 - Found 102 parameters in model. INFO - 01/30/21 18:21:44 - 0:00:40 - Using nn.parallel.DistributedDataParallel ...

` however, nvidia-smi shows that 7 of the 8 GPU has usage of 100% while 1 is 0 % also, top says 7 python processes has cpu usage of 100%

I am not sure what to do because it doesnt log "starting epoch 0" and just remain silent as is. also instead of all 8 gpu being used, only 7 is running.

below is the command i used to start the training: export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name test_enzh_mlm --dump_path ./dumped/ --data_path ./data/processed/en-zh/ --lgs 'en-zh' --clm_steps 'en,zh' --mlm_steps 'en,zh' --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --batch_size 16 --bptt 256 --optimizer adam,lr=0.0001 --epoch_size 200000 --validation_metrics _valid_mlm_ppl --stopping_criterion _valid_mlm_ppl,10

saikoneru commented 3 years ago

@colmantse can you just try with a single gpu with python train.py ... ?

colmantse commented 3 years ago

lemme try

colmantse commented 3 years ago

i got it running with no problem so far, at least its constantly logging `INFO - 01/30/21 19:33:04 - 0:00:23 - Number of parameters (model): 64889361 INFO - 01/30/21 19:33:06 - 0:00:25 - Found 0 memories. INFO - 01/30/21 19:33:06 - 0:00:25 - Found 6 FFN. INFO - 01/30/21 19:33:06 - 0:00:25 - Found 102 parameters in model. INFO - 01/30/21 19:33:06 - 0:00:25 - Optimizers: model INFO - 01/30/21 19:33:06 - 0:00:25 - ============ Starting epoch 0 ... ============ INFO - 01/30/21 19:33:06 - 0:00:25 - Creating new training data iterator (causal,en) ... INFO - 01/30/21 19:34:13 - 0:01:32 - 120 - 115.70 sent/s - 16941.32 words/s - CLM-en: 7.5231 || CLM-zh: 8.4312 || MLM-en: 6.9118 || MLM-zh: 7.8224 - - model LR: 1.0000e-04 [W IndexingUtils.h:25] Warning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (function expandTensors)

` so this has to do with multi gpu?

colmantse commented 3 years ago

so the single gpu version worked. however when i switched back to 8gpu again on a 8gpu environment, it freezes before optimizer log.

edit: if i select 7 gpu instead of 8, the model trains. I also double check in python shell torch.cuda.device_count() and it gives 8, so pytorch have access to 8 gpu but the multi-gpu script only allow me to use 7 otherwise it freezes. Would be great to know if there is a way to fully utilize all 8 gpus.

facebookresearch / XLM

CUDA error: out of memory #274