Closed keloemma closed 4 years ago
Hi @keloemma,
Sorry for the unclear documentation. I have updated the README with detailed instructions and added scripts for splitting and preprocessing the data. Could you please try again?
Good afteroon.
Thank you for your reply, I tried the proposed solution and I am getting this error :
INFO - 01/29/20 17:57:17 - 0:00:00 - The experiment will be stored in data/model/flaubert_base_cased/nqp72nh6ph
INFO - 01/29/20 17:57:17 - 0:00:00 - Running command: python train.py --exp_name flaubert_base_cased --dump_path 'data/model' --data_path 'data/processed/fr_corpus/BPE/10k' --amp 1 --lgs fr --clm_steps '' --mlm_steps fr --emb_dim 768 --n_layers 12 --n_heads 12 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --batch_size 16 --bptt 512 --optimizer 'adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001' --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion '_valid_fr_mlm_ppl,20' --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'
INFO - 01/29/20 17:57:17 - 0:00:00 - Starting time 1580317037.413298 WARNING - 01/29/20 17:57:17 - 0:00:00 - Signal handler installed. INFO - 01/29/20 17:57:17 - 0:00:00 - ============ Monolingual data (fr) INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/train.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - 326787 words (9171 unique) in 16771 sentences. 0 unknown words (0 unique) covering 0.00% of the data.
INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/valid.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - 1732 words (9171 unique) in 84 sentences. 3 unknown words (3 unique) covering 0.17% of the data.
INFO - 01/29/20 17:57:17 - 0:00:00 - Loading data from data/processed/fr_corpus/BPE/10k/test.fr.pth ... INFO - 01/29/20 17:57:17 - 0:00:00 - 1641 words (9171 unique) in 86 sentences. 2 unknown words (2 unique) covering 0.12% of the data.
INFO - 01/29/20 17:57:17 - 0:00:00 - ============ Data summary INFO - 01/29/20 17:57:17 - 0:00:00 - Monolingual data - train - fr: 16771 INFO - 01/29/20 17:57:17 - 0:00:00 - Monolingual data - valid - fr: 84 INFO - 01/29/20 17:57:17 - 0:00:00 - Monolingual data - test - fr: 86
INFO - 01/29/20 17:57:17 - 0:00:00 - Time limit to run script: -1 (min)
Traceback (most recent call last):
File "train.py", line 391, in
Do you have any idea , how can I debug it ? It seems to be linked to namespaces and argument parser.
Hi @keloemma ,
Thanks for reporting the error! Sorry that was a bug. I have fixed it (line 257-259). Could you please try again?
thank , you
I am now getting this error :
INFO - 01/30/20 11:58:13 - 0:00:02 - Number of parameters (model): 92501715
Traceback (most recent call last):
File "train.py", line 391, in
I tried to change the server ( and look for answer in other github code where people have the same error) but I ma still getting the same error when I tried to downgraded the parameters (emb_layers, batch_size etc..) I still get the error or others errors so I was wondering which parameters I should change in order for the command line to work.
@keloemma Could you share the output of nvidia-smi
?
On this server : I get this other error :
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",) plus the one related to the lack of memory in CUDA
the other one is like that
@keloemma There's too little memory available on your servers. If you aim at obtaining a good, well-trained Flaubert on your data, then maybe you should manage to have the necessary resources first (unfortunately pre-training usually needs a lot of them).
Currently the second GPU of the second server has around 3GB available, so maybe you can try training a tiny model with a small batch size on it:
CUDA_VISIBLE_DEVICES=1 python train.py \
--exp_name flaubert_tiny \
--dump_path $dump_path \
--data_path $data_path \
--amp 1 \
--lgs 'fr' \
--clm_steps '' \
--mlm_steps 'fr' \
--emb_dim 64 \
--n_layers 4 \
--n_heads 4 \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--batch_size 4 \
--bptt 512 \
--optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" \
--epoch_size 300000 \
--max_epoch 100000 \
--validation_metrics _valid_fr_mlm_ppl \
--stopping_criterion _valid_fr_mlm_ppl,20 \
--fp16 true \
--accumulate_gradients 16 \
--word_mask_keep_rand '0.8,0.1,0.1' \
--word_pred '0.15'
I haven't tested so I'm not sure how much memory the above will take. Note that the effective batch size is accumulate_gradients*batch_size
(in the above case: 16*4 = 64). You can lower batch_size
and increase accumulate_gradients
to further reduce memory assumption. Reduce bptt
can also help.
Thank you, I will try your proposed solution.
hello, I am coming back to you and I would like to know if you can help me understand
--epoch_size and max_epoch; For what I understood max_epoch =epoch ( entire dataset is passed both forward and backward through the neural network only once) but what is epoch_size in your specific case ?
and do Flaubert_tiny equal to flaubert_small_cased ??
and --stopping_criterion _valid_fr_mlm_ppl,20 \ does this mean if after 20 iterations/or epoch the system is not learning anymore, it stops ??
When the traning finish , i get this file/model
So , is this directory which i am supposed to use for my classification task ?
Iget two files ending with *.pth, so I am guessing I should use the last one.
Hi @keloemma,
--epoch_size and max_epoch; For what I understood max_epoch =epoch ( entire dataset is passed both forward and backward through the neural network only once) but what is epoch_size in your specific case ?
max_epoch
is the maximum number of epochs to be trained. The size of each epoch is not one pass through the entire dataset but it depends on the epoch_size
parameter.
epoch_size
is the number of sentences to be passed in each epoch. Since it would take a very long time to process the entire dataset, you may risk not obtaining anything in case there is some problem happened during training. Therefore, we can truncate the dataset into smaller chunks of epoch_size
to obtain the checkpoints quicker. With the checkpoint files, you can resume training later if needed.
and do Flaubert_tiny equal to flaubert_small_cased ??
No, their architecture are different from each other (you can check out the architecture of our models here). I took an example of a very small network (Flaubert_tiny) so that it would be quicker for you to run and debug the code first, then you can change the parameters to fit the model into your available GPU memory later.
and --stopping_criterion _valid_fr_mlm_ppl,20 does this mean if after 20 iterations/or epoch the system is not learning anymore, it stops ??
Yes, it means that the training will stop if the validation perplexity does not improve (or decrease) for 20 consecutive epochs.
So , is this directory which i am supposed to use for my classification task ? I get two files ending with *.pth, so I am guessing I should use the last one.
Yes, you can use the pretrained weights (saved in *best-valid_fr_mlm_ppl.pth
files) obtained in this directory to fine-tune on your classification task. The files whose name containing checkpoint
are the last 2 checkpoints which can be used for resuming training.
There are 2 .pth
files for each type (checkpoint and best validation model) for safety purpose in case you run into some problem with hard disk space. For example, the weights are being saved but there is no disk space left, you still have the files from previous epochs. So you should use the latest files (without prev_
).
I assume that you got your answer @keloemma ?
Good afternoon,
I tried to follow your instructions to train my own corpus with Flaubert in order to get a model to use for my classification task but I am having trouble understanding the procedure ?
You said we should use this line to train on our preprocessed data :
/Flaubert$ python train.py --exp_name flaubert_base_lower --dump_path ./dumped/ --data_path ./own_data/data/ --lgs 'fr' --clm_steps '' --mlm_steps 'fr' --emb_dim 768 --n_layers 12 --n_heads 12 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 16 --bptt 512 --optimizer "adam_inverse_sqrt,lr=0.0006,warmup_updates=24000,beta1=0.9,beta2=0.98,weight_decay=0.01,eps=0.000001" --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_fr_mlm_ppl --stopping_criterion _valid_fr_mlm_ppl,20 --fp16 true --accumulate_gradients 16 --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'
I tried it after cloning flaubert and all the necessary librairies but I am getting this error :
FAISS library was not found. FAISS not available. Switching to standard nearest neighbors search implementation. ./own_data/data/train.fr.pth not found ./own_data/data/valid.fr.pth not found ./own_data/data/test.fr.pth not found Traceback (most recent call last): File "train.py", line 387, in
check_data_params(params)
File "/ho/ge/ke/eXP/Flaubert/xlm/data/loader.py", line 302, in check_data_params
assert all([all([os.path.isfile(p) for p in paths.values()]) for paths in params.mono_dataset.values()])
AssertionError
Does this mean I have to split my own data in three corpus after preprocessing it ? (train , valid and test ??) Should I use your preprossed script on my own before executing the command ?