OOM to train translation models with very large multilingual dataset.

SefaZeng commented 2 years ago

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I try to train a multilingual translation model with several languages dataset. But fairseq will load all the data into the memory so the training failed as I have about 1T datasets after preprocess by fairseq. I want to do the language resampling in a dynamic way so I do not prefer to concat all the data together and split it data1,data2,data3 ... (It can also consume a very large amount of time everytime I changed the dataset of one language) So, is there a way to train a multilingual translation model throught task translation_multi_simple_epoch and load the dataset in a streaming way to reduce the RAM requirements.

Code

Preprocess script like this:

langs="zh,en,de,fr,it,pt,ru,es"
for lang in ${langs};
do        
    vocab=vocab.txt
    output_dir=/workspace/data/fairseq_bin
    src=${lang}
    tgt=en

    # <<<<<<<<<<<<<<<<<<<<<<<<< lang-en <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    #/home/user/miniconda/bin/python3 
    /home/user/miniconda/bin/python3 $code_dir/fairseq_cli/preprocess.py --source-lang $src --target-lang $tgt \
        --destdir $output_dir \
        --srcdict $work_dir/$vocab \
        --tgtdict $work_dir/$vocab \
        --workers 20 \
        --validpref $work_dir/data_for_fair/valid \
        --trainpref $work_dir/data_for_fair/train_${src}${tgt} \
done

Train script:

OMP_NUM_THREADS=20 \
/home/user/miniconda/bin/python3 -m torch.distributed.launch --nproc_per_node=8 \
 --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=$MASTER_ADDR \
 --master_port=$MASTER_PORT \
  $fairseq_dir/train.py \
  $data_dir \
 --task translation_multi_simple_epoch \
 --sampling-method "temperature" \
 --sampling-temperature 1.5 \
 --encoder-langtok "src" \
 --decoder-langtok \
 --langs "$lang_list" \
 --lang-pairs "$lang_pairs" \
 --save-dir $output_dir \
 --arch transformer \
 --attention-dropout 0.1 \
 --activation-dropout 0.1 \
 --dropout 0.1 \
 --encoder-layers 20 \
 --decoder-layers 20 \ 
 --encoder-embed-dim 1024 \
 --decoder-embed-dim 1024 \
 --encoder-attention-heads 16 \
 --encoder-attention-heads 16 \
 --encoder-ffn-embed-dim 4096 \
 --decoder-ffn-embed-dim 4096 \
 --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-8 --clip-norm 0.0 \
 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
 --warmup-init-lr 1e-07 \
 --weight-decay 0.0001 \
 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
 --max-tokens 4096 \
 --update-freq 4 \ 
 --log-interval 100 \
 --save-interval-updates 2000 \
 --skip-invalid-size-inputs-valid-test \
 --save-interval 100000000000 \
 --num-workers 1 \ 
 --seed 42 \
 --fp16  \
 --ddp-backend=no_c10d \
 2>&1 |tee ${LOG_FILE}

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or main): main
PyTorch Version (e.g., 1.0) 1.10
OS (e.g., Linux): Linux
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

gmryu commented 2 years ago

Instead of having one folder as a $data_dir like fairseq-train /data-bin/, have more more folders like fairseq-train chat1:chat2:chat3:....:chatN and make sure each chat_i folder has a train.bin,idx. Then fairseq will load only one folder among them at the same time, in which is called "robin-round" fasion. The point is using : to seperate folders (containing .bin and .idx)

For validation, only the first training folder's valid,bin,.idx is used. So you have to put valid.bin,idx inside the chat1 (for this example)

chat1,chat2 are just random folder names. Since you want a multilingual model, you may want every chat folder to have every language data, because fairseq reads folders in the given order. In this example, it trains with chat1's train.bin then chat2's and no folder shuffle is done by fairseq.

SefaZeng commented 2 years ago

Instead of having one folder as a $data_dir like fairseq-train /data-bin/, have more more folders like fairseq-train chat1:chat2:chat3:....:chatN and make sure each chat_i folder has a train.bin,idx. Then fairseq will load only one folder among them at the same time, in which is called "robin-round" fasion. The point is using : to seperate folders (containing .bin and .idx)

For validation, only the first training folder's valid,bin,.idx is used. So you have to put valid.bin,idx inside the chat1 (for this example)

chat1,chat2 are just random folder names. Since you want a multilingual model, you may want every chat folder to have every language data, because fairseq reads folders in the given order. In this example, it trains with chat1's train.bin then chat2's and no folder shuffle is done by fairseq.

Hi, @gmryu I have split the dataset into shards, and the size of each shard is 170G. I try to train a model with 2 nodes with 8 A100 and 700GB memory in each node. And it consumes almost 600+ GB of memory to load one shard of data. When loading the next shard, it seems the memory used for the first one is not released and the out-of-memory error occurs as it needs much more than 1.2T memory. So should I split the dataset to a smaller size?

gmryu commented 2 years ago

Are you using like --num-shards --shard-id? They are different from : . If you use :, previous memory should be released.

Anyway, in this case, 170GB per epoch is way too much. I would say 1GB per epoch is already too much as well.

SefaZeng commented 2 years ago

Are you using like --num-shards --shard-id? They are different from : . If you use :, previous memory should be released.

Anyway, in this case, 170GB per epoch is way too much. I would say 1GB per epoch is already too much as well.

Yes, I use : to link all the data shards together. So, if the dataset is 4TB, then I need to split it into 4k data shards, right?

facebookresearch / fairseq