facebookresearch / UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation
Other
1.51k stars 262 forks source link

Why I get BLEU 1.01 on zh-en of PBSMT ? #49

Closed socaty closed 5 years ago

socaty commented 5 years ago

Hi,

I was confused for several days.

I followed the steps of PBSMT/run.sh to do my work, and I think the most important step is "Running MUSE to generate cross-lingual embeddings". I aligned the 'zh' and 'en' pre-trained word vectors you provided on [https://fasttext.cc/docs/en/crawl-vectors.html] with MUSE, and got "Adv-NN P@1=21.3、Adv-CSLS P@1=26.9、Adv-Refine-NN P@1=18.5、Adv-Refine-CSLS P@1=24.0".

Then, I used the aligned embeddings to generate the phrase-table, but finally I got BLEU of 1.01. I don't think the result is right. Something must have gone wrong.

My command of MUSE is: python unsupervised.py --src_lang ch \ --tgt_lang en \ --src_emb /data/experiment/embeddings/wiki.ch.300.vec.20w \ --tgt_emb /data/experiment/embeddings/wiki.en.300.vec.20w \ --exp_name test \ --exp_id 0 \ --normalize_embeddings center \ --emb_dim 300 \ --dis_most_frequent 50000 \ --epoch_size 500000 \ --dico_eval /data/experiment/unsupervisedMT/fordict/zh-en.5000-6500.sim.txt \ --n_refinement 5 \ --export "pth"

My command for generate phrase table is: python create-phrase-table.py \ --src_lang $SRC \ --tgt_lang $TGT \ --src_emb $ALIGNED_EMBEDDINGS_SRC \ --tgt_emb $ALIGNED_EMBEDDINGS_TGT \ --csls 1 \ --max_rank 200 \ --max_vocab 300000 \ --inverse_score 1 \ --temperature 45 \ --phrase_table_path ${PHRASE_TABLE_PATH::-3}

Does the problem lay in the word embeddings, shoud I use the word embeddings trained on my training data with fastText for MUSE? I have tried it (use the word embedding trained on my training data), but got "Adv-NN P@1=0.07、Adv-CSLS P@1=0.07、Adv-Refine-NN P@1=0.00、Adv-Refine-CSLS P@1=0.00". My command is : ./fasttext skipgram -epoch 10 -minCount 0 -dim 300 -thread 48 -ws 5 -neg 10 -input $SRC_TOK -output $EMB_SRC. So I did't use the word embedding generated on training data , because I think I didn't align them well.

So, where is the fault?

cocaer commented 5 years ago

I have the same problem, @glample . Can u do me a favor?

glample commented 5 years ago

Hi,

So I'm not sure about your issue, I have not tried en-zh. However, the approach should work for en-zh, I know that this paper did it: https://arxiv.org/pdf/1804.09057.pdf Maybe could you try to use their same setup / datasets / preprocessing etc?

How big are your monolingual corpora?

Also what is ch in wiki.ch.300.vec.20w ? Isn't this code for Chamorro and not Chinese which is zh? You can use your own corpus, it's probably better, because this way you have embeddings associated with your tokenization / text pre-processing, but if your corpora are small or if you don't have good P@1 accuracy then the fastText ones are probably better.

Also can you try MUSE with the script supervised.py --dico_train identical_char instead of unsupervised.py? It will align words by taking as anchor points words that are identical in both languages. It sometimes works better than adversarial, even for distant languages.

socaty commented 5 years ago

ok , thanks

Julisa-test commented 5 years ago

Hi,

I did the same task on zh-en, and the trained models cannot be used to translate test sets, why? Like this:

image

The train.log of MUSE as follow: train.log

This problem has been bothering me for a long time, can you give me some guidance? @glample

glample commented 5 years ago

The train.log of MUSE seems reasonable. Not sure about the message returned by Moses, it's a Moses specific issue. Is it just a warning? What is at the end of the log? Did you have a look at the phrase table you generated? Does it look good?

Julisa-test commented 5 years ago

Thank you for your reply.

The end of the log is like this: image And I think the phrase table looks good. image

@glample

wingsyuan commented 4 years ago

Hi, @socaty @cocaer Thanks very much!!! when I generated the phrase-table and tried translating test sentences, error occur as follow: can you give me some advise?

Linking phrase-table path... Translating test sentences... Defined parameters (per moses.ini or switch): config: /data/home/super/mt/dataset/muti-domain/unmt/moses_train_en-zh/model/moses.ini distortion-limit: 6 feature: UnknownWordPenalty WordPenalty PhrasePenalty PhraseDictionaryMemory name=TranslationModel0 num-features=2 path=/data/home/super/mt/dataset/muti-domain/unmt/moses_train_en-zh/model/phrase-table.gz input-factor=0 output-factor=0 Distortion KENLM name=LM0 factor=0 path=/data/home/super/mt/dataset/muti-domain/unmt/data/zh.lm.blm order=5 input-factors: 0 mapping: 0 T 0 threads: 48 weight: UnknownWordPenalty0= 1 WordPenalty0= -1 PhrasePenalty0= 0.2 TranslationModel0= 0.2 0.2 Distortion0= 0.3 LM0= 0.5 line=UnknownWordPenalty FeatureFunction: UnknownWordPenalty0 start: 0 end: 0 line=WordPenalty FeatureFunction: WordPenalty0 start: 1 end: 1 line=PhrasePenalty FeatureFunction: PhrasePenalty0 start: 2 end: 2 line=PhraseDictionaryMemory name=TranslationModel0 num-features=2 path=/data/home/super/mt/dataset/muti-domain/unmt/moses_train_en-zh/model/phrase-table.gz input-factor=0 output-factor=0 FeatureFunction: TranslationModel0 start: 3 end: 4 line=Distortion FeatureFunction: Distortion0 start: 5 end: 5 line=KENLM name=LM0 factor=0 path=/data/home/super/mt/dataset/muti-domain/unmt/data/zh.lm.blm order=5 FeatureFunction: LM0 start: 6 end: 6 Loading UnknownWordPenalty0 Loading WordPenalty0 Loading PhrasePenalty0 Loading Distortion0 Loading LM0 Loading TranslationModel0 Start loading text phrase table. Moses format : [0.502] seconds Reading /data/home/super/mt/dataset/muti-domain/unmt/moses_train_en-zh/model/phrase-table.gz ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


Exception: moses/TranslationModel/RuleTable/LoaderStandard.cpp:202 in bool Moses::RuleTableLoaderStandard::Load(const Moses::AllOptions&, Moses::FormatType, const std::vector&, const std::vector&, const string&, size_t, Moses::RuleTableTrie&) threw util::Exception because `isnan(score)'. Bad score -- on line 119910

socaty commented 4 years ago

@wingsyuan Sorry, I am not sure about the reason of your problem. In my opinion, is it wrong when training phrase-table? Please check your training process,and I will send my traing script to your email soon.

Shuailong commented 4 years ago

@wingsyuan I have the same problem here. Have you solved this?