Open smart-patrol opened 4 years ago
@smart-patrol There are some small differences between original XLM
and XLM-R
models. The translation_from_pretrained_xlm
was not updated to work with updated XLM-R
model because we didn't evaluate it on translation tasks.
However, please feel free to submit PR, I am to review and merge.
@smart-patrol There are some small differences between original
XLM
andXLM-R
models. Thetranslation_from_pretrained_xlm
was not updated to work with updatedXLM-R
model because we didn't evaluate it on translation tasks. However, please feel free to submit PR, I am to review and merge.
Hi @ngoyal2707 , for the translation_from_pretrained_xlm task, I trained an XLM model according to here based on wiki corpus(downloaded and tokenized according toXLM Github Repository), but it didn't work.
It reported : Transformer encoder / decoder state_dict does not contain embed_positions.weight.
Then, could you please give me a hint that where can I obtain the XLM model which can be loaded in the translation_from_pretrained_xlm task? I tried all the models I can find in XLM Github Repository) but it didn't work, either. I don't know what can I do now. :(
Thank you in advance.
The data used for translation was preprocessed like:
fairseq-preprocess --source-lang $src --target-lang $tgt \
--srcdict $SRCDICT \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir $DESTDIR --workers 20
My train script is like:
CUDA_VISIBLE_DEVICES=0,1,2 fairseq-train \
$DATADIR \
--criterion label_smoothed_cross_entropy \
--pretrained-xlm-checkpoint ./checkpoints/mlm_wiki/checkpoint_best.pt \
--init-encoder-only --save-dir checkpoints/trans_xlm_new_g2p \
--optimizer adam --dropout 0.3 --weight-decay 0.0001 \
--max-tokens 500 --lr 5e-4 --activation-fn gelu \
--arch transformer_from_pretrained_xlm \
--task translation_from_pretrained_xlm
When I run the training script, it reported the following trace:
Traceback (most recent call last):
File "/home/zhangjiawen/anconda/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/zhangjiawen/code/fairseq/fairseq_cli/train.py", line 270, in distributed_main
main(args, init_distributed=True)
File "/home/zhangjiawen/code/fairseq/fairseq_cli/train.py", line 64, in main
model = task.build_model(args)
File "/home/zhangjiawen/code/fairseq/fairseq/tasks/translation.py", line 264, in build_model
return super().build_model(args)
File "/home/zhangjiawen/code/fairseq/fairseq/tasks/fairseq_task.py", line 187, in build_model
return models.build_model(args, self)
File "/home/zhangjiawen/code/fairseq/fairseq/models/__init__.py", line 48, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 63, in build_model
return super().build_model(args, task)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer.py", line 221, in build_model
encoder = cls.build_encoder(args, src_dict, encoder_embed_tokens)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 67, in build_encoder
return TransformerEncoderFromPretrainedXLM(args, src_dict, embed_tokens)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 128, in __init__
pretrained_xlm_checkpoint=args.pretrained_xlm_checkpoint,
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 107, in upgrade_state_dict_with_xlm_weights
subkey, key, pretrained_xlm_checkpoint)
AssertionError: odict_keys(['version', 'embed_tokens.weight', 'embed_positions._float_tensor', 'layers.0.self_attn.k_proj.weight', 'layers.0.self_attn.k_proj.bias', 'layers.0.self_attn.v_proj.weight', 'layers.0.self_attn.v_proj.bias', 'layers.0.self_attn.q_proj.weight', 'layers.0.self_attn.q_proj.bias', 'layers.0.self_attn.out_proj.weight', 'layers.0.self_attn.out_proj.bias', 'layers.0.self_attn_layer_norm.weight', 'layers.0.self_attn_layer_norm.bias', 'layers.0.fc1.weight', 'layers.0.fc1.bias', 'layers.0.fc2.weight', 'layers.0.fc2.bias', 'layers.0.final_layer_norm.weight', 'layers.0.final_layer_norm.bias', 'layers.1.self_attn.k_proj.weight', 'layers.1.self_attn.k_proj.bias', 'layers.1.self_attn.v_proj.weight', 'layers.1.self_attn.v_proj.bias', 'layers.1.self_attn.q_proj.weight', 'layers.1.self_attn.q_proj.bias', 'layers.1.self_attn.out_proj.weight', 'layers.1.self_attn.out_proj.bias', 'layers.1.self_attn_layer_norm.weight', 'layers.1.self_attn_layer_norm.bias', 'layers.1.fc1.weight', 'layers.1.fc1.bias', 'layers.1.fc2.weight', 'layers.1.fc2.bias', 'layers.1.final_layer_norm.weight', 'layers.1.final_layer_norm.bias', 'layers.2.self_attn.k_proj.weight', 'layers.2.self_attn.k_proj.bias', 'layers.2.self_attn.v_proj.weight', 'layers.2.self_attn.v_proj.bias', 'layers.2.self_attn.q_proj.weight', 'layers.2.self_attn.q_proj.bias', 'layers.2.self_attn.out_proj.weight', 'layers.2.self_attn.out_proj.bias', 'layers.2.self_attn_layer_norm.weight', 'layers.2.self_attn_layer_norm.bias', 'layers.2.fc1.weight', 'layers.2.fc1.bias', 'layers.2.fc2.weight', 'layers.2.fc2.bias', 'layers.2.final_layer_norm.weight', 'layers.2.final_layer_norm.bias', 'layers.3.self_attn.k_proj.weight', 'layers.3.self_attn.k_proj.bias', 'layers.3.self_attn.v_proj.weight', 'layers.3.self_attn.v_proj.bias', 'layers.3.self_attn.q_proj.weight', 'layers.3.self_attn.q_proj.bias', 'layers.3.self_attn.out_proj.weight', 'layers.3.self_attn.out_proj.bias', 'layers.3.self_attn_layer_norm.weight', 'layers.3.self_attn_layer_norm.bias', 'layers.3.fc1.weight', 'layers.3.fc1.bias', 'layers.3.fc2.weight', 'layers.3.fc2.bias', 'layers.3.final_layer_norm.weight', 'layers.3.final_layer_norm.bias', 'layers.4.self_attn.k_proj.weight', 'layers.4.self_attn.k_proj.bias', 'layers.4.self_attn.v_proj.weight', 'layers.4.self_attn.v_proj.bias', 'layers.4.self_attn.q_proj.weight', 'layers.4.self_attn.q_proj.bias', 'layers.4.self_attn.out_proj.weight', 'layers.4.self_attn.out_proj.bias', 'layers.4.self_attn_layer_norm.weight', 'layers.4.self_attn_layer_norm.bias', 'layers.4.fc1.weight', 'layers.4.fc1.bias', 'layers.4.fc2.weight', 'layers.4.fc2.bias', 'layers.4.final_layer_norm.weight', 'layers.4.final_layer_norm.bias', 'layers.5.self_attn.k_proj.weight', 'layers.5.self_attn.k_proj.bias', 'layers.5.self_attn.v_proj.weight', 'layers.5.self_attn.v_proj.bias', 'layers.5.self_attn.q_proj.weight', 'layers.5.self_attn.q_proj.bias', 'layers.5.self_attn.out_proj.weight', 'layers.5.self_attn.out_proj.bias', 'layers.5.self_attn_layer_norm.weight', 'layers.5.self_attn_layer_norm.bias', 'layers.5.fc1.weight', 'layers.5.fc1.bias', 'layers.5.fc2.weight', 'layers.5.fc2.bias', 'layers.5.final_layer_norm.weight', 'layers.5.final_layer_norm.bias']) Transformer encoder / decoder state_dict does not contain embed_positions.weight. Cannot load encoder.sentence_encoder.embed_positions.weight from pretrained XLM checkpoint ./checkpoints/mlm_wiki/checkpoint_best.pt into Transformer.
I've encountered the same problem after I trained an XLM using fairseq code. I get the same exception. Any conclusion?
I've figured it out, you need to add the args: --encoder-learned-pos and --decoder-learned-pos
@ngoyal2707 now does translation_from_pretrained_xlm task support xlm-r models. ?
@tonylekhtman, were you able finetune your trained XLM in fairseq for NMT in fairseq?
@ajesujoba I was able to pretrain the xlm model and then finetune it for nmt. The pretraining and fine-tuning was done using fairseq
Hi @tonylekhtman, that's great!! can you please share your training script for the pretraining and fine-tuning using fairseq? Thanks!
The pretraining code is taken from here: https://github.com/pytorch/fairseq/tree/master/examples/cross_lingual_language_model
Then you need to preprocess the bilinugal data you are interested in ft on using fairseq-preprocess.
The fineutining code is as follows:
fairseq-train --data /path/to/preprocessed_bilingual_data --task translation_from_pretrained_xlm -a transformer_from_pretrained_xlm --pretrained-xlm-checkpoint /path/to/pretrained_model_checkpoint --max-tokens 4000 --encoder-embed-dim 1024 --decoder-embed-dim 1024 --encoder-ffn-embed-dim 4096 --encoder-learned-pos --decoder-learned-pos --max-source-positions 256 --max-target-positions 256 --num-workers 6
Cool! Thanks @tonylekhtman . Had the same command, just wanted to be sure. Thanks once again!
@tonylekhtman Hi! Does the XLM pretrain from fairseq only support the MLM? I found the origin XLM repository can pretrain for MLM+TLM, but fairseq's example saied only MLM is supported.
@ngoyal2707 now does translation_from_pretrained_xlm task support xlm-r models. ?
I found the same bug. Another error is "Cannot load decoder.sentence_encoder.layers.0.self_attn.in_proj_weight from pretrained XLM checkpoint". Does anyone meet this problem?
🐛 Bug
Attempting to run training in with xlmr large using
transformer_from_pretrained_xlm
for tasktranslation_from_pretrained_xlm
.Not sure if bug is a good term here as it's not documented and I have been trying to piece together what to do via Fairseq's and XLMR's repos.
To Reproduce
This will eventually generate the following stack trace:
Code sample
So that it points to a single dict running:
Training
Expected behavior
Ideally, would like to use the the weights from the 100 lang model to fine tune NMT for monolingual or multilingual models.
Environment
pip
, source): sourceAdditional context
Yes, I referenced issues #907 and #787 before opening.
Would be willing to help here as it will save some arctic ice sheets if the model can start pretrained for translation tasks.