facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.88k stars 497 forks source link

Getting Assertion error: How to use XLM for Unsupervised NMT of language pairs other than English-French, English-German and English-Romanian #339

Open rashikumar01 opened 3 years ago

rashikumar01 commented 3 years ago

How XLM can be pretrained on other monolingual languages dataset and then be used for Unsupervised NMT. I have preprocessed the data and then run this command:

!python train.py --exp_name test_sahi_mlm --dump_path ./dumped/ --data_path ./data/processed/sa-hi/ --lgs 'sa-hi' --clm_steps '' --mlm_steps 'sa,hi' --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 32 --bptt 256 --optimizer adam,lr=0.0001 --epoch_size 200000 --validation_metrics _valid_mlm_ppl --stopping_criterion _valid_mlm_ppl,10 --fp16 true

I get the following error: File "/content/drive/MyDrive/XLM/xlm/data/loader.py", line 26, in process_binarized (data['sentences'].dtype == np.int32) and (1 << 16 <= len(dico) < 1 << 31)) AssertionError

saikoneru commented 3 years ago

Can you preprocess the data again and try (delete your already processed data/ use a new folder). I think something is wrong during pre-processing