Open SefaZeng opened 2 years ago
Instead of having one folder as a $data_dir
like fairseq-train /data-bin/
, have more more folders like fairseq-train chat1:chat2:chat3:....:chatN
and make sure each chat_i folder has a train.bin,idx.
Then fairseq will load only one folder among them at the same time, in which is called "robin-round" fasion.
The point is using :
to seperate folders (containing .bin and .idx)
For validation, only the first training folder's valid,bin,.idx is used. So you have to put valid.bin,idx inside the chat1 (for this example)
chat1,chat2 are just random folder names. Since you want a multilingual model, you may want every chat folder to have every language data, because fairseq reads folders in the given order. In this example, it trains with chat1's train.bin then chat2's and no folder shuffle is done by fairseq.
Instead of having one folder as a
$data_dir
likefairseq-train /data-bin/
, have more more folders likefairseq-train chat1:chat2:chat3:....:chatN
and make sure each chat_i folder has a train.bin,idx. Then fairseq will load only one folder among them at the same time, in which is called "robin-round" fasion. The point is using:
to seperate folders (containing .bin and .idx)For validation, only the first training folder's valid,bin,.idx is used. So you have to put valid.bin,idx inside the chat1 (for this example)
chat1,chat2 are just random folder names. Since you want a multilingual model, you may want every chat folder to have every language data, because fairseq reads folders in the given order. In this example, it trains with chat1's train.bin then chat2's and no folder shuffle is done by fairseq.
Hi, @gmryu I have split the dataset into shards, and the size of each shard is 170G. I try to train a model with 2 nodes with 8 A100 and 700GB memory in each node. And it consumes almost 600+ GB of memory to load one shard of data. When loading the next shard, it seems the memory used for the first one is not released and the out-of-memory error occurs as it needs much more than 1.2T memory. So should I split the dataset to a smaller size?
Are you using like --num-shards --shard-id
? They are different from :
.
If you use :
, previous memory should be released.
Anyway, in this case, 170GB per epoch is way too much. I would say 1GB per epoch is already too much as well.
Are you using like
--num-shards --shard-id
? They are different from:
. If you use:
, previous memory should be released.Anyway, in this case, 170GB per epoch is way too much. I would say 1GB per epoch is already too much as well.
Yes, I use :
to link all the data shards together. So, if the dataset is 4TB, then I need to split it into 4k data shards, right?
❓ Questions and Help
Before asking:
What is your question?
I try to train a multilingual translation model with several languages dataset. But fairseq will load all the data into the memory so the training failed as I have about 1T datasets after preprocess by fairseq. I want to do the language resampling in a dynamic way so I do not prefer to concat all the data together and split it data1,data2,data3 ... (It can also consume a very large amount of time everytime I changed the dataset of one language) So, is there a way to train a multilingual translation model throught task
translation_multi_simple_epoch
and load the dataset in a streaming way to reduce the RAM requirements.Code
Preprocess script like this:
Train script:
What have you tried?
What's your environment?
pip
, source):