facebookresearch / UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation
Other
1.51k stars 262 forks source link

Why I get low BLEU on zh-en of NMT ? #83

Open JxuHenry opened 5 years ago

JxuHenry commented 5 years ago

I only modified the corpus and trained it. Corpus preprocessing is the same as "get_data_enfr.sh" file wrote. Operating parameters are as follows: python main.py --exp_name zhTest --transformer True --n_enc_layers 4 --n_dec_layers 4 --share_enc 3 --share_dec 3 --share_lang_emb True --share_output_emb True --langs 'en,zh' --n_mono -1 --mono_dataset 'zh:./data/mono/all.zh.tok.60000.pth,,;en:./data/mono/all.en.tok.60000.pth,,' --para_dataset 'en-zh:,./data/para/newdev/newsdev2017-enzh-src.XX.60000.pth,./data/para/newdev/newsdev2017-zhen-ref.XX.60000.pth' --mono_directions 'zh,en' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.2 --pivo_directions 'en-zh-en,zh-en-zh' --pretrained_emb './data/mono/all.zh-en.60000.vec' --pretrained_out True --lambda_xe_mono '0:1,100000:0.1,300000:0' --lambda_xe_otfd 1 --otf_num_processes 30 --otf_sync_params_every 1000 --enc_optimizer adam,lr=0.0001 --epoch_size 500000 --stopping_criterion bleu_zh_en_valid,10 Do I need to modify other things?

HAOHAOXUEXI5776 commented 5 years ago

Hi JxuHenry, I'm curious about where to obtain the monolingual corpus for Chinese? Could you share your experience? Thx in advance.

JxuHenry commented 5 years ago

Hi JxuHenry, I'm curious about where to obtain the monolingual corpus for Chinese? Could you share your experience? Thx in advance.

Hi!The UN officially provides parallel corpus for conferences, but I only used Chinese corpus for training.

HAOHAOXUEXI5776 commented 5 years ago

Hi!The UN officially provides parallel corpus for conferences, but I only used Chinese corpus for training. Oh, thanks for your reply. I found yesterday a nice Chinese corpus for multiply tasks, which also contains monoligual corpus. If you have interest, you can find it here

JxuHenry commented 5 years ago

Hi!The UN officially provides parallel corpus for conferences, but I only used Chinese corpus for training. Oh, thanks for your reply. I found yesterday a nice Chinese corpus for multiply tasks, which also contains monoligual corpus. If you have interest, you can find it here

OK, thank you very much

cycao77 commented 5 years ago

Hi JxuHenry, I also had the same problem. Have you solved it ?

JxuHenry commented 5 years ago

Hi JxuHenry, I also had the same problem. Have you solved it ?

No I haven't,sorry

JianLiu91 commented 4 years ago

Hi, how do you obtain the shared embeddings ./data/mono/all.zh-en.60000.vec ? Trained on the concatenate data using fastest? Have you tried on using MUSE to get the aligned embeddings? I think it might help.