facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

out of memory error while training translation model #1254

Closed allhelllooz closed 4 years ago

allhelllooz commented 5 years ago

Getting following error on p3.2xlarge single 16gb gpu machine. It ran properly when I had ~400k sentence pairs. I added more data .. now ~2000k sentence pairs and its not running. I trimmed down sentences >64 tokens and set max-tokens to 64 as well still not working out.

  1. Should I go for multi-gpu approach ??
  2. What parameters should I set to run it properly ?? Given I have 2000k sentence pairs with <64 tokens per sentence.
  3. What is already allocated and cached memory ??

I am running following command

CUDA_VISIBLE_DEVICES=0 fairseq-train /data/translation_models/marathi_english_translation/mr_en_new/token_data --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9,0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4096 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 64 --update-freq 8 --max-source-positions 64 --max-target-positions 64

Error logs

| epoch 001:   0%| | 1/85544 [00:01<44:15:59,  1.86s/it, loss=20.737, nll_loss=20.759, ppl=1774897.87, wps=5, ups=0, wpb=176.000, bsz=9.000, num_updates=1, lr=1.2207e-07, gnorm=0.000, clip=0.000, oom=5.000, wall=33, train_wall=2]| WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 5.13 GiB (GPU 0; 15.75 GiB total capacity; 8.89 GiB already allocated; 2.39 GiB free; 3.43 GiB cached);
 Skipping batch
| WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 180.00 MiB (GPU 0; 15.75 GiB total capacity; 14.57 GiB already allocated; 94.81 MiB free; 50.29 MiB cached);
 Skipping batch
| WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 5.13 GiB (GPU 0; 15.75 GiB total capacity; 8.89 GiB already allocated; 2.39 GiB free; 3.43 GiB cached);
 Skipping batch
| WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0; 15.75 GiB total capacity; 14.54 GiB already allocated; 124.81 MiB free; 52.72 MiB cached);
 Skipping batch
| WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 5.13 GiB (GPU 0; 15.75 GiB total capacity; 8.89 GiB already allocated; 2.39 GiB free; 3.43 GiB cached);
 Skipping batch
Traceback (most recent call last):
  File "/root/anaconda3/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/data/translation_models/fairseq/fairseq_cli/train.py", line 327, in cli_main
    main(args)
  File "/data/translation_models/fairseq/fairseq_cli/train.py", line 81, in main
    train(args, trainer, task, epoch_itr)
  File "/data/translation_models/fairseq/fairseq_cli/train.py", line 122, in train
    log_output = trainer.train_step(samples)
  File "/data/translation_models/fairseq/fairseq/trainer.py", line 405, in train_step
    self.optimizer.step()
  File "/data/translation_models/fairseq/fairseq/optim/fairseq_optimizer.py", line 98, in step
    self.optimizer.step(closure)
  File "/data/translation_models/fairseq/fairseq/optim/adam.py", line 140, in step
    state['exp_avg'] = torch.zeros_like(p_data_fp32)
RuntimeError: CUDA out of memory. Tried to allocate 5.13 GiB (GPU 0; 15.75 GiB total capacity; 14.02 GiB already allocated; 644.81 MiB free; 62.01 MiB cached)

Following are other initial logs

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_
softmax_dropout=0, arch='transformer_iwslt_de_en', attention_dropout=0.0, best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False,
 criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='/data/translation_models/marathi_english_translation/mr_en_new/toke
n_data', dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder
_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, di
stributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout
=0.3, encoder_attention_heads=4, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, enco
der_normalize_before=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_s
cale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, layer_wise_attention=False, la
zy_load=False, left_pad_source='True', left_pad_target='False', load_alignments=False, log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_
sqrt', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=64, max_target_positions=64, max_tokens=64, max_tokens_valid=64, max_up
date=0, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_cross_attention=False, no_epoch_checkpoints=F
alse, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_token_positional_embeddings=False, num_workers=1, opt
imizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False,
 reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share
_all_embeddings=False, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation
', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, train_subset='train', update_freq=[8], upsample_primary=1, use_bmuf=F
alse, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=-1, warmup_updates=4096, weight_decay=0.0001)
| [mr] dictionary: 2691064 types
| [en] dictionary: 922896 types
| loaded 20248 examples from: /data/translation_models/marathi_english_translation/mr_en_new/token_data/valid.mr-en.mr
| loaded 20248 examples from: /data/translation_models/marathi_english_translation/mr_en_new/token_data/valid.mr-en.en
| /data/translation_models/marathi_english_translation/mr_en_new/token_data valid mr-en 20248 examples

and

| model transformer_iwslt_de_en, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 1881890816 (num. trained: 1881890816)
| training on 1 GPUs
| max tokens per GPU = 64 and max sentences per GPU = None
| no existing checkpoint found checkpoints/checkpoint_last.pt 
| loading train data for epoch 0
| loaded 2024847 examples from: /data/translation_models/marathi_english_translation/mr_en_new/token_data/train.mr-en.mr
| loaded 2024847 examples from: /data/translation_models/marathi_english_translation/mr_en_new/token_data/train.mr-en.en
| /data/translation_models/marathi_english_translation/mr_en_new/token_data train mr-en 2024847 examples
allhelllooz commented 5 years ago

I read in the issues that we should keep dictionary ~50k. Why this should be a limit ? Initially when I ran with combined dict size ~600k everything ran properly with 1024 token even. Now it is 3.6mil ... is it a problem ?? Can multi-gpu solve this or I need larger gpu to train this (like 24 or 32gb vram gpu)?? Can someone give me suggestions ! Thanks.

bricksdont commented 5 years ago

Do you mean that you have a vocabulary size of 3.6 million? That would be a problem. How did you preprocess your data? Did you use BPE to segment the data?

allhelllooz commented 5 years ago

Yes. I did use BPE and still dict sizes are as follows

| [mr] dictionary: 2691064 types
| [en] dictionary: 922896 types

Problem is moses or nltk tokenizers can't break words in devnagari script well. It works for Roman scripts. Bert-Tokenizer is another option or I can go character-level ... model size will be much less but not sure how much time/accuracy it would be required.

Anyways using BPE with moses still gives 922896 dict values for English which is also huge. Do you recommend anything else ?

allhelllooz commented 5 years ago

ahhh got it. I used only --tokenizer option ... not --bpe which has sentencepiece, subword_nmt, fastbpe, gpt2 options to choose from. Which one should I go for ?? I will try and update here about model sizes in my case. Thanks @bricksdont for pointing out.

bricksdont commented 5 years ago

All options are fine. But if I remember correctly, GPT produces a byte-level vocabulary, which is quite different from the other options.

I would recommend to use sentencepiece, it might even work without tokenization.

Vocabulary size can be quite low, 32k is a reasonable value to try first.

allhelllooz commented 5 years ago

Found these strange results. I used bert-tokenizer from huggingface and then used preprocess without bpe option got following dict sizes.

| [mr] Dictionary: 22967 types
| [mr] /data/translation_models/marathi_english_translation/mr_en_token/raw_data/train.mr: 2024847 sents, 95037672 tokens, 0.0% replaced by <unk>
| [en] Dictionary: 28167 types
| [en] /data/translation_models/marathi_english_translation/mr_en_token/raw_data/train.en: 2024847 sents, 51040154 tokens, 0.0% replaced by <unk>

Now my model has ~50 million params and takes 7-8gb gpu memory. Good enough to work with.

Then I used sentencepiece with moses and subword-nmt with moses and got following dict sizes.

| [mr] Dictionary: 2691063 types
| [mr] /data/translation_models/marathi_english_translation/mr_en_new/raw_data/train.mr: 2024847 sents, 32508211 tokens, 0.0% replaced by <unk>
| [en] Dictionary: 922895 types
| [en] /data/translation_models/marathi_english_translation/mr_en_new/raw_data/train.en: 2024847 sents, 36592253 tokens, 0.0% replaced by <unk>

With this model has ~2 billion params and takes >15gb gpu memory. Not sure why dict size didn't decrease in case of sentencepiece or subword-nmt !!

allhelllooz commented 4 years ago

Can someone update over this issue ?? Seems like a bug !!

myleott commented 4 years ago

Can you share more details about how you ran sentencepiece or subword-nmt?

This script should be a useful reference for learning a vocab with sentencepiece: https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt17-multilingual.sh#L100-L126

allhelllooz commented 4 years ago

I am using workpiece for now and its working well. Will take a look at above script soon and update. Currently closing the issue. Thanks buddy.

NikhilCherian commented 4 years ago

@myleott @allhelllooz @bricksdont @ynd

Hello. I am currently using Fairseq for Grammatical Correction with translation models. I could run the translation models like fconv and transformer on Google Colab. But, when I switched over to my gaming laptop, I encountered some problem with training the models with fairseq.

Currently, when i run the script. python train.py data-bin/lang-8-fairseq2 --save-dir checkpoints/lang-8-fairseq-transformer2 --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --max-tokens 4096 --fp16

It gives me this error. image image

I do not know, how it is overflowing and how to get run? Can somebody help me with this? Any help would be appreciated. Thanks in advance.