getalp / Flaubert

Unsupervised Language Model Pre-training for French
Other
243 stars 31 forks source link

How can I train flaubert on a different corpus (not gutenberg, wiki) but for another domain ? #9

Closed keloemma closed 4 years ago

keloemma commented 4 years ago

Good afternoon,

I tried to follow your instructions to train my own corpus with Flaubert in order to get a model to use for my classification task but I am having trouble understanding the procedure ?

You said we should use this line to train on our preprocessed data :

/Flaubert$ python train.py --exp_name flaubert_base_lower --dump_path ./dumped/ --data_path ./own_data/data/ --lgs 'fr' --clm_steps '' --mlm_steps 'fr' --emb_dim 768 --n_layers 12 --n_heads 12 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 16 --bptt 512 --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion _valid_fr_mlm_ppl,20 --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

I tried it after cloning flaubert and all the necessary librairies but I am getting this error :

FAISS library was not found. FAISS not available. Switching to standard nearest neighbors search implementation. ./own_data/data/train.fr.pth not found ./own_data/data/valid.fr.pth not found ./own_data/data/test.fr.pth not found Traceback (most recent call last): File "train.py", line 387, in check_data_params(params) File "/ho/ge/ke/eXP/Flaubert/xlm/data/loader.py", line 302, in check_data_params assert all([all([os.path.isfile(p) for p in paths.values()]) for paths in params.mono_dataset.values()]) AssertionError

Does this mean I have to split my own data in three corpus after preprocessing it ? (train , valid and test ??) Should I use your preprossed script on my own before executing the command ?

formiel commented 4 years ago

Hi @keloemma,

Sorry for the unclear documentation. I have updated the README with detailed instructions and added scripts for splitting and preprocessing the data. Could you please try again?

keloemma commented 4 years ago

Good afteroon.

Thank you for your reply, I tried the proposed solution and I am getting this error :

INFO - 01/29/20 17:57:17 - 0:00:00 - The experiment will be stored in data/model/flaubert_base_cased/nqp72nh6ph

INFO - 01/29/20 17:57:17 - 0:00:00 - Running command: python train.py --exp_name flaubert_base_cased --dump_path 'data/model' --data_path 'data/processed/fr_corpus/BPE/10k' --amp 1 --lgs fr --clm_steps '' --mlm_steps fr --emb_dim 768 --n_layers 12 --n_heads 12 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --batch_size 16 --bptt 512 --optimizer 'adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001' --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion '_valid_fr_mlm_ppl,20' --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

INFO - 01/29/20 17:57:17 - 0:00:00 - Starting time 1580317037.413298 WARNING - 01/29/20 17:57:17 - 0:00:00 - Signal handler installed. INFO - 01/29/20 17:57:17 - 0:00:00 - ============ Monolingual data (fr) INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/train.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - 326787 words (9171 unique) in 16771 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/valid.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - 1732 words (9171 unique) in 84 sentences. 3 unknown words (3 unique) covering 0.17% of the data.

INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/test.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - 1641 words (9171 unique) in 86 sentences. 2 unknown words (2 unique) covering 0.12% of the data.

INFO - 01/29/20 17:57:17 - 0:00:00 - ============ Data summary INFO - 01/29/20 17:57:17 - 0:00:00 - Monolingual data - train - fr: 16771 INFO - 01/29/20 17:57:17 - 0:00:00 - Monolingual data - valid - fr: 84 INFO - 01/29/20 17:57:17 - 0:00:00 - Monolingual data - test - fr: 86

INFO - 01/29/20 17:57:17 - 0:00:00 - Time limit to run script: -1 (min) Traceback (most recent call last): File "train.py", line 391, in main(params) File "train.py", line 260, in main model = build_model(params, data['dico']) File "/data1/home/ge/ke/eXP/Flaubert/xlm/model/init.py", line 112, in build_model model = TransformerModel(params, dico, is_encoder=True, with_output=True) File "/data1/ho/ge/ke/eXP/Flaubert/xlm/model/transformer.py", line 257, in init self.layerdrop = params.get('layerdrop', 0.0) AttributeError: 'Namespace' object has no attribute 'get'

Do you have any idea , how can I debug it ? It seems to be linked to namespaces and argument parser.

formiel commented 4 years ago

Hi @keloemma ,

Thanks for reporting the error! Sorry that was a bug. I have fixed it (line 257-259). Could you please try again?

keloemma commented 4 years ago

thank , you

I am now getting this error : INFO - 01/30/20 11:58:13 - 0:00:02 - Number of parameters (model): 92501715 Traceback (most recent call last): File "train.py", line 391, in main(params) File "train.py", line 260, in main model = build_model(params, data['dico']) File "/data1/ho/eXP/Flaubert/xlm/model/init.py", line 139, in build_model return model.cuda() File "/ho/anaconda3/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 305, in cuda return self._apply(lambda t: t.cuda(device)) File "/ho/anaconda3/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply module._apply(fn) File "/ho/anaconda3/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply module._apply(fn) File "/ho/anaconda3/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply module._apply(fn) File "/ho/anaconda3/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in _apply param_applied = fn(param) File "/ho/anaconda3/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 305, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.91 GiB total capacity; 64.92 MiB already allocated; 8.06 MiB free; 5.08 MiB cached)

I tried to change the server ( and look for answer in other github code where people have the same error) but I ma still getting the same error when I tried to downgraded the parameters (emb_layers, batch_size etc..) I still get the error or others errors so I was wondering which parameters I should change in order for the command line to work.

formiel commented 4 years ago

@keloemma Could you share the output of nvidia-smi?

keloemma commented 4 years ago

On this server : I get this other error :

Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",) plus the one related to the lack of memory in CUDA

image

the other one is like that

image

formiel commented 4 years ago

@keloemma There's too little memory available on your servers. If you aim at obtaining a good, well-trained Flaubert on your data, then maybe you should manage to have the necessary resources first (unfortunately pre-training usually needs a lot of them).

Currently the second GPU of the second server has around 3GB available, so maybe you can try training a tiny model with a small batch size on it:

CUDA_VISIBLE_DEVICES=1 python train.py \
    --exp_name flaubert_tiny \
    --dump_path $dump_path \
    --data_path $data_path \
    --amp 1 \
    --lgs 'fr' \
    --clm_steps '' \
    --mlm_steps 'fr' \
    --emb_dim 64 \
    --n_layers 4 \
    --n_heads 4 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --gelu_activation true \
    --batch_size 4 \
    --bptt 512 \
    --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" \
    --epoch_size 300000 \
    --max_epoch 100000 \
    --validation_metrics _valid_fr_mlm_ppl \
    --stopping_criterion _valid_fr_mlm_ppl,20 \
    --fp16 true \
    --accumulate_gradients 16 \
    --word_mask_keep_rand '0.8,0.1,0.1' \
    --word_pred '0.15'     

I haven't tested so I'm not sure how much memory the above will take. Note that the effective batch size is accumulate_gradients*batch_size (in the above case: 16*4 = 64). You can lower batch_size and increase accumulate_gradients to further reduce memory assumption. Reduce bptt can also help.

keloemma commented 4 years ago

Thank you, I will try your proposed solution.

keloemma commented 4 years ago

hello, I am coming back to you and I would like to know if you can help me understand

--epoch_size and max_epoch; For what I understood max_epoch =epoch ( entire dataset is passed both forward and backward through the neural network only once) but what is epoch_size in your specific case ?

and do Flaubert_tiny equal to flaubert_small_cased ??

and --stopping_criterion _valid_fr_mlm_ppl,20 \ does this mean if after 20 iterations/or epoch the system is not learning anymore, it stops ??

When the traning finish , i get this file/model

image

So , is this directory which i am supposed to use for my classification task ?

Iget two files ending with *.pth, so I am guessing I should use the last one.

formiel commented 4 years ago

Hi @keloemma,

--epoch_size and max_epoch; For what I understood max_epoch =epoch ( entire dataset is passed both forward and backward through the neural network only once) but what is epoch_size in your specific case ?

max_epoch is the maximum number of epochs to be trained. The size of each epoch is not one pass through the entire dataset but it depends on the epoch_size parameter.

epoch_size is the number of sentences to be passed in each epoch. Since it would take a very long time to process the entire dataset, you may risk not obtaining anything in case there is some problem happened during training. Therefore, we can truncate the dataset into smaller chunks of epoch_size to obtain the checkpoints quicker. With the checkpoint files, you can resume training later if needed.

and do Flaubert_tiny equal to flaubert_small_cased ??

No, their architecture are different from each other (you can check out the architecture of our models here). I took an example of a very small network (Flaubert_tiny) so that it would be quicker for you to run and debug the code first, then you can change the parameters to fit the model into your available GPU memory later.

and --stopping_criterion _valid_fr_mlm_ppl,20 does this mean if after 20 iterations/or epoch the system is not learning anymore, it stops ??

Yes, it means that the training will stop if the validation perplexity does not improve (or decrease) for 20 consecutive epochs.

So , is this directory which i am supposed to use for my classification task ? I get two files ending with *.pth, so I am guessing I should use the last one.

Yes, you can use the pretrained weights (saved in *best-valid_fr_mlm_ppl.pth files) obtained in this directory to fine-tune on your classification task. The files whose name containing checkpoint are the last 2 checkpoints which can be used for resuming training.

There are 2 .pth files for each type (checkpoint and best validation model) for safety purpose in case you run into some problem with hard disk space. For example, the weights are being saved but there is no disk space left, you still have the files from previous epochs. So you should use the latest files (without prev_).

schwabdidier commented 4 years ago

I assume that you got your answer @keloemma ?