facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

FineTune Wav2Vec2.0 , CUDA OOM #2633

Closed 2Bye closed 2 years ago

2Bye commented 4 years ago

What is i question?

I am finetune a Wav2Vec 2.0 Large (LV-60) model using my dataset (35 hours) for ASR Task. Training takes 15-20 epochs, after which the error CUDA OUT OF MEMORY appears.

What have i tried?

I tried changing the max-tokens parameter in the GPU

What's environment?

I use docker container NVidia nemo:0.10

Run command

python train.py --distributed-world-size 4 \ ru_manifest/ \ --save-dir checkpoint_wer/ \ --fp16 \ --post-process letter \ --valid-subset valid \ --no-epoch-checkpoints \ --best-checkpoint-metric wer \ --num-workers 32 \ --max-update 400000 \ --sentence-avg \ --task audio_pretraining \ --arch wav2vec_ctc \ --w2v-path checkpoints_wav2vec/wav2vec_vox.pt \ --labels ltr \ --apply-mask \ --mask-selection static \ --mask-other 0 \ --mask-length 10 \ --mask-prob 0.65 \ --layerdrop 0.1 \ --mask-channel-selection static \ --mask-channel-other 0 \ --mask-channel-length 64 \ --mask-channel-prob 0.256 \ --zero-infinity \ --feature-grad-mult 0.0 \ --freeze-finetune-updates 10000 \ --validate-after-updates 10000 \ --optimizer adam \ --adam-betas '(0.9, 0.98)'\ --adam-eps 1e-08 \ --lr 2e-05 \ --lr-scheduler tri_stage \ --warmup-steps 4000 \ --hold-steps 8000 \ --decay-steps 10000 \ --final-lr-scale 0.05 \ --final-dropout 0.0 \ --dropout 0.0 \ --activation-dropout 0.1 \ --criterion ctc \ --attention-dropout 0.0 \ --max-tokens 900000 \ --seed 2337 \ --log-format json \ --log-interval 500 \ --ddp-backend no_c10d \ --tensorboard-logdir tensorboard-wav2vec/

008karan commented 4 years ago

@byebye1 have found any solution of OOM issue getting similar issue. After 8 epoch its showing OOM

medabalimi commented 3 years ago

Try setting the --empty-cache-freq option. Alternatively reduce --max-tokens. Seems to be a GPU RAM fragmentation issue.

ArtemisZGL commented 3 years ago

I met this issue after finetune 3 epoch for train-clean-100 using one 2080ti. And it's strange that in previous 3 epoch everything is ok, but if I load the ckpt saved and resume from 3 epoch, I get the oom error immediately. @medabalimi and I have reduced the max tokens but seems not work.

medabalimi commented 3 years ago

@ArtemisZGL did you set the --empty-cache-freq option.

ArtemisZGL commented 3 years ago

@medabalimi yes I just set it to 100, but I'm confused that I can't find it in any code by searching. And it did't solve the problem yet. Is that my code have some problem, I just clone it from github several days ago. And I follow the evaluate CTC model command, meeting many parser errors. But I successfully fine-tune the small model with CTC.

medabalimi commented 3 years ago

I suggested using the --empty-cache-freq option because that helped me with OOM issues. This helps clear the pytorch cache at specified intervals at the cost of speed. I'm assuming that you're installed Nvidia's Apex as well. What is the checkpoint size?

ArtemisZGL commented 3 years ago

@medabalimi Thanks for your reply. I think I did not install apex, will it have some influence? And I set the flag " --empty-cache-freq" to 100 seems not work. Once I load the ckpt from epoch3, it just OOM immediately. I even have tried to use 2 gpus to train from scratch, but still oom in epoch 3. The ckpt size is 3.6G, is it normal? thanks.

And I am confused that, why after 3 epochs show oom, if my batch size is too big, it should OOM in first epoch ?

Today I try to increase the max-tokens with small model, also met OOM problem after 36 epoch. Once I load the ckpt to resume again, it show the OOM again, stucking at the stage "attempting to recover from OOM in forward/backward pass". The small ckpt is 1.1G.

After changing the random seed, I can resume the ckpt to train again...

yash-s20 commented 3 years ago

Any resolution for this? I'm facing the same issue after 4 epochs. I even increased the number of GPUs I was training on, but even on resuming on the 4th epoch it fails immediately. If changing the seed helps for a few epochs, there's some seed for which the batch is too big for any GPU to handle. Even when resumed, the seed is set (differently, but deterministically) according to epoch number Am I interpreting this right?

ArtemisZGL commented 3 years ago

@yash-s20 my solution is to reduce the max-tokens and set the update-frequent bigger to keep the batch size. But don’t know the real reason caused this problem.

alealv commented 3 years ago

I'm also facing OOM errors when Fine-Tunning the model.

The server has 10 GPUs but nvidia-smi doesn't seem to be saying there is no memory left. It reports that just between 10% and 15% of it's in use.

I'm using:

2020-11-09 19:24:02 | INFO | fairseq_cli.train | model: Wav2VecCtc
2020-11-09 19:24:02 | INFO | fairseq_cli.train | criterion: CtcCriterion)
2020-11-09 19:24:02 | INFO | fairseq_cli.train | num. model params: 315471520 (num. trained: 315471520)
2020-11-09 19:24:09 | INFO | fairseq.utils | ***********************CUDA enviroments for all 10 workers***********************
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   0: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   1: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   2: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   3: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   4: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   5: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   6: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   7: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   8: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | rank   9: capabilities =  7.5  ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-11-09 19:24:09 | INFO | fairseq.utils | ***********************CUDA enviroments for all 10 workers***********************
2020-11-09 19:24:09 | INFO | fairseq_cli.train | training on 10 devices (GPUs/TPUs)
2020-11-09 19:24:09 | INFO | fairseq_cli.train | max tokens per GPU = 1280000 and batch size per GPU = 10
2020-11-09 19:24:09 | INFO | fairseq.trainer | no existing checkpoint found /home/aalvarez/trainings/checkpoint_last.pt                                                                                              
2020-11-09 19:24:09 | INFO | fairseq.trainer | loading train data for epoch 1-11-09 19:24:10 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 1467296, skipped 0 samples
2020-11-09 19:24:14 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-09 19:24:14 | INFO | fairseq.trainer | begin training epoch 1
2020-11-09 19:25:02 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-11-09 19:25:03 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 32.0
2020-11-09 19:25:05 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 16.0
2020-11-09 19:25:07 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 8.0
2020-11-09 19:25:12 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 644.00 MiB (GPU 8; 10.76 GiB total capacity; 7.38 GiB already allocated; 467.56 MiB fr
ee; 7.40 GiB reserved in total by PyTorch)                                                                
2020-11-09 19:25:12 | WARNING | fairseq.trainer | |===========================================================================|                                                                                      
|                  PyTorch CUDA memory summary, device ID 8                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 3         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    6917 MB |    7561 MB |  101651 MB |   94733 MB |
|       from large pool |    6913 MB |    7557 MB |  101539 MB |   94625 MB |
|       from small pool |       3 MB |       9 MB |     112 MB |     108 MB |
|---------------------------------------------------------------------------|
| Active memory         |    6917 MB |    7561 MB |  101651 MB |   94733 MB |
|       from large pool |    6913 MB |    7557 MB |  101539 MB |   94625 MB |
|       from small pool |       3 MB |       9 MB |     112 MB |     108 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    7578 MB |    7942 MB |   18258 MB |   10680 MB |
|       from large pool |    7570 MB |    7932 MB |   18228 MB |   10658 MB |
|       from small pool |       8 MB |      10 MB |      30 MB |      22 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |   16790 KB |    1242 MB |  113718 MB |  113701 MB |
|       from large pool |   12584 KB |    1242 MB |  113566 MB |  113554 MB |
|       from small pool |    4206 KB |       4 MB |     151 MB |     147 MB |
|---------------------------------------------------------------------------|
| Allocations           |     458    |     886    |   11608    |   11150    |
|       from large pool |     159    |     306    |    6479    |    6320    |
|       from small pool |     299    |     582    |    5129    |    4830    |
|---------------------------------------------------------------------------|
| Active allocs         |     458    |     886    |   11608    |   11150    |
|       from large pool |     159    |     306    |    6479    |    6320    |
|       from small pool |     299    |     582    |    5129    |    4830    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      44    |      79    |     196    |     152    |
|       from large pool |      40    |      74    |     181    |     141    |
|       from small pool |       4    |       5    |      15    |      11    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      10    |      32    |    4114    |    4104    |
|       from large pool |       2    |      30    |    3470    |    3468    |
|       from small pool |       8    |      12    |     644    |     636    |
|===========================================================================|

This is how I'm launching the training

python -W ignore train.py /mnt/data/ale/manifest_ml2 \
--save-dir ~/trainings \
--wer-args '("/mnt/data/ale/kenlm-models/openwebtext/v5/4-gram-265M-25-10-2020-pruned-300K-1.bin","/mnt/data/ale/manifest_ml2/dev.lex",2,-1)' \
--post-process letter  \
--valid-subset dev \
--best-checkpoint-metric wer  \
--num-workers 20 \
--max-update 80000  \
--sentence-avg  \
--task audio_pretraining  \
--arch wav2vec_ctc  \
--w2v-path /mnt/data/ale/models/wav2vec_vox_new.pt \
--labels ltr  \
--apply-mask  \
--mask-selection static  \
--mask-other 0  \
--mask-length 10  \
--mask-prob 0.5  \
--layerdrop 0.1 \
--mask-channel-selection static  \
--mask-channel-other 0  \
--mask-channel-length 64  \
--mask-channel-prob 0.5  \
--zero-infinity \
--feature-grad-mult 0.0  \
--freeze-finetune-updates 10000  \
--validate-after-updates 10000  \
--optimizer adam \
--adam-betas '(0.9, 0.98)'  \
--adam-eps 1e-08  \
--lr 2e-05  \
--lr-scheduler tri_stage  \
--warmup-steps 8000  \
--hold-steps 32000 \
--decay-steps 40000  \
--final-lr-scale 0.05  \
--final-dropout 0.0  \
--dropout 0.0  \
--activation-dropout 0.1  \
--criterion ctc \
--attention-dropout 0.0  \
--max-tokens 100000  \
--seed 2337  \
--log-format json  \
--log-interval 500  \
--ddp-backend no_c10d \
--normalize \
--batch-size 10 \
--fp16 \
--empty-cache-freq 100

What's environment?

fairseq Version - master PyTorch Version - 1.7.0+cuda11 OS - Ubuntu 20.04 LTS : Python version - 3.8 CUDA Version: 11.1 GPU models and configuration - I use 10 RTX 2018 Ti

How you installed fairseq -

git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable ./

mychiux413 commented 3 years ago

I noticed that long audio file might HAVE CHANGE to cause OOM but not every epoch. try --max-sample-size 160000 or even smaller to make sure the OOM issue is not due to long audio sample, if OOM doesn't occur again, then you should skip those long audio file according to your devices capability.

To train wav2vec_vox_new.pt such large model, my hardware (GTX 1080 8GB x 2) can only afford --max-sample-size 208000, which is 13 seconds audio and batch-size=1 without --fp16, and I notice that the --batch-size-valid can still be 8, seems training updates step cost much memory than valid.

008karan commented 3 years ago

@mychiux413 is your training going fine on 2*1080? I also have a similar system.

mychiux413 commented 3 years ago

@008karan I am still tracking it, and if it crashes again, I will report it.

alealv commented 3 years ago

Thank you very much for your comment @mychiux413!! It seems that now it's working.

Although, I'm still confused:

  1. In the dataset loading documentation they mention --max-tokens. Is this valid or not? I'm thinking now that this was written for another datatype loading class.
  2. What happens with audio files that are larger than max-sample-size?
  3. How can I estimate how much memory it will occupy? or is it only a matter of trial and error?
  4. And why is it that the documentation is so old and bad? In order to succeed in running almost anything, you need to dig into every issue in github.

Tracing back Audio_pretraining.py it uses FileAudioDataset which inherits RawAudioDataset which has a function called num_tokens which returns the min(size, max_sample_size), but I don't know where this is used and how.

mychiux413 commented 3 years ago

@alealv

  1. It should be valid, because to train wav2vec_vox_new.pt, I have to set --max-tokens=500000, or it will still OOM.
  2. Those files will be cropped, see self.crop_to_max_size, every larger file size will be --max-sample-size, so the best way is never put large audio in manifest file at the first.
  3. No idea, I haven't studied that how memory grow with multihead-self-attention to difference sequence size
  4. It frustrated me too.

Update: --batch-size-valid=8, --max-sample-size=208000 can still OOM at validate phase, so I set --batch-size-valid=1 for safety now.

mychiux413 commented 3 years ago

Update: The memory will increase after the number of updates > --freeze-finetune-updates, so we should --freeze-finetune-updates=0 at fist try to make sure it won't OOM after unfreezing, and maybe that's why the OOM occurred after few epochs, which depends on your data size. And the other information is, if we enable --update_freq > 1, the distributed process also increase the GPU memory usage.

With GTX1080(8GB) * 2, I can only filter audio duration less than 10 secs to prevent OOM, which drops too many dataset. Does any one can share the affordable audio duration with GPU > 8GB?

alealv commented 3 years ago

Thanks for the update @mychiux413

I've been experimenting as well with a small training size and I realized something really weird.

Configuration

Expected output

I believe that with this configuration, I should get 8 log lines for each sample in the training set. Then a validation step which should output 6 logs lines. And finally, the updated parameters followed by a checkpoint.

Ouput

Instead, I get just 1 log line for the training step and the validation step, but on that line, the parameter nstences are equal to the training_size and the dev_size respectibly.

2020-11-20 18:37:12 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 1                                                                           
2020-11-20 18:37:12 | INFO | fairseq.trainer | no existing checkpoint found /home/aalvarez/trainings/checkpoint_last.pt                                                         
2020-11-20 18:37:12 | INFO | fairseq.trainer | loading train data for epoch 1                                                                                                   
2020-11-20 18:37:12 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 8, skipped 0 samples                                                                                 
2020-11-20 18:37:12 | INFO | fairseq.optim.adam | using FusedAdam                                                                                                               
2020-11-20 18:37:12 | INFO | fairseq.trainer | begin training epoch 1                                                                                                           
2020-11-20 18:37:22 | INFO | train_inner | {"epoch": 1, "update": 1.0, "loss": "255.939", "ntokens": "40", "nsentences": "8", "nll_loss": "51.188", "wps": "0", "ups": "0", "wpb": "40", "bsz": "8", "num_updates": "
1", "lr": "2.02475e-07", "gnorm": "171.406", "loss_scale": "128", "train_wall": "2", "wall": "11"}                                                                              
2020-11-20 18:37:22 | INFO | fairseq_cli.train | begin validation on "dev" subset                                                                                               
2020-11-20 18:37:48 | INFO | dev | {"epoch": 1, "dev_loss": "188.179", "dev_ntokens": "31", "dev_nsentences": "6", "dev_nll_loss": "36.422", "dev_uer": "100", "dev_wer": "100", "dev_raw_wer": "100", "dev_wps": "0"
, "dev_wpb": "31", "dev_bsz": "6", "dev_num_updates": "1"}                                                                                                                      
2020-11-20 18:37:48 | INFO | fairseq_cli.train | begin save checkpoint                                                                                                          
2020-11-20 18:38:05 | INFO | fairseq.checkpoint_utils | saved checkpoint /home/aalvarez/trainings/checkpoint1.pt (epoch 1 @ 1 updates, score 100.0) (writing took 16.460464287083596 seconds)
2020-11-20 18:38:05 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)                                                                                     
2020-11-20 18:38:05 | INFO | train | {"epoch": 1, "train_loss": "255.939", "train_ntokens": "40", "train_nsentences": "8", "train_nll_loss": "51.188", "train_wps": "0", "train_ups": "0", "train_wpb": "40", "train_
bsz": "8", "train_num_updates": "1", "train_lr": "2.02475e-07", "train_gnorm": "171.406", "train_loss_scale": "128", "train_train_wall": "2", "train_wall": "53"}                
2020-11-20 18:38:05 | INFO | fairseq.trainer | begin training epoch 2                                                                                                         

So I believe that fairseq is actually loading more samples than the indicated in batch-size and that combine with large audios could make everything explode.

I haven't dig yet into the code to understand it. Maybe @myleott or @alexeib can explain what's going on here.

LiNaihan commented 3 years ago

Hi folks,

I encountered the same issue, and I just solved it. In my case, the reason why OOM when reloading is, the checkpoint is reloaded to the CPU first, then broadcasted to other processes by Inter-process communication:

the reloading code is at fairseq/trainer.py, line 323:

        if bexists:
            if self.data_parallel_rank == 0:
                state = checkpoint_utils.load_checkpoint_to_cpu(filename)
                last_optim_state = state.get("last_optimizer_state", None)

                # If doing zero_sharding, do not broadcast global optimizer
                # state. Later we will broadcast sharded states to each rank
                # to avoid memory from exploding.
                if (
                        self.cfg.distributed_training.zero_sharding == "os"
                        and "last_optimizer_state" in state
                        and self.data_parallel_world_size > 1
                ):
                    state["last_optimizer_state"] = "SHARDED"
            else:
                last_optim_state = None
                state = None

            if self.data_parallel_world_size > 1:
                group = (
                    self.data_parallel_process_group
                    if self.data_parallel_process_group is not None
                    else torch.distributed.group.WORLD
                )
                state = distributed_utils.broadcast_object(
                    state,
                    src_rank=0,
                    group=group,
                )
                if self.data_parallel_rank > 0:
                    last_optim_state = state.get("last_optimizer_state", None)

2020-11-19T06:55:32.107Z: [1,0]:Traceback (most recent call last): 2020-11-19T06:55:32.107Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/examples/wav2vec/../../train.py", line 14, in 2020-11-19T06:55:32.108Z: [1,0]: cli_main() 2020-11-19T06:55:32.108Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq_cli/train.py", line 391, in cli_main 2020-11-19T06:55:32.108Z: [1,0]: distributed_utils.call_main(cfg, main) 2020-11-19T06:55:32.108Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq/distributed_utils.py", line 311, in call_main 2020-11-19T06:55:32.109Z: [1,0]: distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) 2020-11-19T06:55:32.109Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq/distributed_utils.py", line 289, in distributed_main 2020-11-19T06:55:32.109Z: [1,0]: main(cfg, *kwargs) 2020-11-19T06:55:32.109Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq_cli/train.py", line 120, in main 2020-11-19T06:55:32.109Z: [1,0]: disable_iterator_cache=task.has_sharded_data("train"), 2020-11-19T06:55:32.109Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq/checkpoint_utils.py", line 196, in load_checkpoint 2020-11-19T06:55:32.110Z: [1,0]: reset_meters=reset_meters, 2020-11-19T06:55:32.110Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq/trainer.py", line 368, in load_checkpoint 2020-11-19T06:55:32.120Z: [1,0]: self.optimizer.load_state_dict(last_optim_state, optimizer_overrides) 2020-11-19T06:55:32.121Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq/optim/fp16_optimizer.py", line 88, in load_state_dict 2020-11-19T06:55:32.125Z: [1,0]: self.fp32_optimizer.load_state_dict(state_dict, optimizer_overrides) 2020-11-19T06:55:32.125Z: [1,0]: File "/data/e2easr2/v-naili/proj/fairseq/fairseq/optim/fairseq_optimizer.py", line 86, in load_state_dict 2020-11-19T06:55:32.145Z: [1,0]: self.optimizer.load_state_dict(state_dict) 2020-11-19T06:55:32.145Z: [1,0]: File "/opt/conda/lib/python3.7/site-packages/torch/optim/optimizer.py", line 105, in load_state_dict 2020-11-19T06:55:32.145Z: [1,0]: state_dict = deepcopy(state_dict) 2020-11-19T06:55:32.145Z: [1,0]: File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy 2020-11-19T06:55:32.145Z: [1,0]: y = copier(x, memo) 2020-11-19T06:55:32.145Z: [1,0]: File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict 2020-11-19T06:55:32.145Z: [1,0]: y[deepcopy(key, memo)] = deepcopy(value, memo) 2020-11-19T06:55:32.145Z: [1,0]: File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy 2020-11-19T06:55:32.145Z: [1,0]: y = copier(x, memo) 2020-11-19T06:55:32.145Z: [1,0]: File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict 2020-11-19T06:55:32.145Z: [1,0]: y[deepcopy(key, memo)] = deepcopy(value, memo) 2020-11-19T06:55:32.146Z: [1,0]: File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy 2020-11-19T06:55:32.146Z: [1,0]: y = copier(x, memo) 2020-11-19T06:55:32.146Z: [1,0]: File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict 2020-11-19T06:55:32.146Z: [1,0]: y[deepcopy(key, memo)] = deepcopy(value, memo) 2020-11-19T06:55:32.146Z: [1,0]: File "/opt/conda/lib/python3.7/copy.py", line 161, in deepcopy 2020-11-19T06:55:32.146Z: [1,0]: y = copier(memo) 2020-11-19T06:55:32.146Z: [1,0]: File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 52, in deepcopy 2020-11-19T06:55:32.146Z: [1,0]: new_storage = self.storage().deepcopy(memo) 2020-11-19T06:55:32.146Z: [1,0]: File "/opt/conda/lib/python3.7/site-packages/torch/storage.py", line 28, in deepcopy 2020-11-19T06:55:32.146Z: [1,0]: newstorage = self.clone() 2020-11-19T06:55:32.146Z: [1,0]: File "/opt/conda/lib/python3.7/site-packages/torch/storage.py", line 44, in clone 2020-11-19T06:55:32.146Z: [1,0]: return type(self)(self.size()).copy(self) 2020-11-19T06:55:32.146Z: [1,0]: File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 480, in _lazy_new 2020-11-19T06:55:32.147Z: [1,0]: return super(_CudaBase, cls).new(cls, args, **kwargs) 2020-11-19T06:55:32.147Z: [1,0]:RuntimeError: CUDA out of memory. Tried to allocate 1.18 GiB (GPU 0; 22.38 GiB total capacity; 2.94 GiB already allocated; 1.01 GiB free; 2.96 GiB reserved in total by PyTorch)

state = distributed_utils.broadcast_object should: 1. if current proc rank is 0, send the ckpt to a buffer; 2. if not, received the ckpt from the buffer. However, if you check the ckpt loaded from the buffer, you find that the tensors in "last_optim_state " (last_optim_state = state.get("last_optimizer_state", None) ) are all on cuda:0! I don't know why this happens, but when reloading it to the processes, each of them will deepcopy it, as shown in the exception traceback, therefore OOM is encountered on cuda:0.

My solution is very simple, just avoid the inter-process communication, and load the ckpt from the disk in each process!

        if bexists:
            state = checkpoint_utils.load_checkpoint_to_cpu(filename)
            last_optim_state = state.get("last_optimizer_state", None)
            # if self.data_parallel_rank == 0:
            #     state = checkpoint_utils.load_checkpoint_to_cpu(filename)
            #     last_optim_state = state.get("last_optimizer_state", None)
            #
            #     # If doing zero_sharding, do not broadcast global optimizer
            #     # state. Later we will broadcast sharded states to each rank
            #     # to avoid memory from exploding.
            #     if (
            #             self.cfg.distributed_training.zero_sharding == "os"
            #             and "last_optimizer_state" in state
            #             and self.data_parallel_world_size > 1
            #     ):
            #         state["last_optimizer_state"] = "SHARDED"
            # else:
            #     last_optim_state = None
            #     state = None
            #
            # if self.data_parallel_world_size > 1:
            #     group = (
            #         self.data_parallel_process_group
            #         if self.data_parallel_process_group is not None
            #         else torch.distributed.group.WORLD
            #     )
            #     state = distributed_utils.broadcast_object(
            #         state,
            #         src_rank=0,
            #         group=group,
            #     )
            #     if self.data_parallel_rank > 0:
            #         last_optim_state = state.get("last_optimizer_state", None)
            #     # check_ckpt_device(self, last_optim_state)
alexeib commented 3 years ago

the OOM bug during validation is because max_tokens_valid isnt being correctly defaulted to max_tokens. i will submit a fix, meanwhile just provide --max-tokens-valid and set it to the same thing as --max-tokens

alealv commented 3 years ago

So I believe that fairseq is actually loading more samples than the indicated in batch-size and that combine with large audios could make everything explode.

Update

I discovered that batch-size is the number of samples loaded per GPU, which is really misleading. So care must be taken with this value because all the inputs are loaded first to one GPU and spread to the others as explained here

alexeib commented 3 years ago

right, the same is true for max-tokens - these are all per gpu values and must be multiplied by the number of gpus being used and also by update_freq to compute the effective batch size

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!