[BLEU is 0] Use the code on dialogue generation, find the inference result is very bad and bleu score is near 0.

HAOHAOXUEXI5776 commented 5 years ago

I use this code on chinese-english translation task, and I find the result is promisng. However, when I transfer it to dialogue generation, the result is terrribly bad, and the bleu score is shockingly low. Dose anyone have the same experience with me? Thanks for your wise advice in advance.

HAOHAOXUEXI5776 commented 5 years ago

My expression may cause some confusion. I mean that I used this code to train a dialogue generation model and found the result was terrible.

jemmryx commented 5 years ago

CaptainDra commented 5 years ago

hello? I also use the modules on ch-en translation, but the result is not good as de-en translation. I think it is the problem of sentencepiece, but I am not sure about it. Do you have done sth before you train?

leedavid312 commented 5 years ago

I'm also working on ch-en translation recently . However, both the training and testing process meet the same error as zero division, while calculating BLUE score. I wonder whether you had encountered that problem or not, and how did you solve it? Thanks!

CaptainDra commented 5 years ago

No. I used the previous version(TF 1.2.0), which is easier for me to understand. But I still confused about the BLEU score

trx14 commented 5 years ago

zero division means when you calculate the BLEU, one of your n-gram result( n =4 in this project) was 0. you could use a soft BLEU to solve the problem.

HAOHAOXUEXI5776 commented 5 years ago

My practice is as follws: I use the casia2015 chinese-english translation corpus as train/dev/test dataset, after tokenizing the english and chinese data, I get 'prepro' files. Then I use the prepro.py to get 'segmented' files (I make some modifications to prepro.py, because I have already obtained the 'prepro' files). All configurations, including the bpe vocabulary size, are not changed. Then I change the train1,train2,eval1,... in hparams.py to the chinese-english data path. Then I run the train.py. After 20 epochs, the bleu score is 16.86, and the translation looks good.

Hope it will help.

CaptainDra commented 5 years ago

the problem is... I use the news-commentary ch-en translation corpus as train dataset, and other news as test dataset. But the BLEU score is 26.13, even if the result include lots of. (I don't think it could be better than de-en translation)

HAOHAOXUEXI5776 commented 5 years ago

You mean there exist lots of '.' in the result?

CaptainDra commented 5 years ago

no i mean ( u n k ), I think the problem is the max_length for each sentence(after I review the code today).

HAOHAOXUEXI5776 commented 5 years ago

I wonder

HAOHAOXUEXI5776 commented 5 years ago

up

Lu-Tan commented 5 years ago

I also encountered with this trouble when using the code on dialogue generation. When evaling, the model just generated many responses that had no difference with others. How can I solve it ?

Lu-Tan commented 5 years ago

Do you mean the solution is to increase the value of max_length for each sentence? I am wandering how can I measure the right value to set as max_length that is suitable for my data. Now the max_length in my code is 300. But the model just worked really bad.

liuyeah commented 4 years ago

I also use this model to train English-chinese translation task, but the result is terrible. The eval result is the same sentence in each epoch. Have you ever encounter similar problem?

sonyawong commented 4 years ago

I have the same issue. have u fixed it?

duguiming111 commented 4 years ago

@HAOHAOXUEXI5776 请问你中英翻译，中文是怎么处理的，我使用iwlst2015的数据，中文用jieba做分词，其他的和原文代码一样，第二个epoch后BLEU就为0，后面一直是0。

HAOHAOXUEXI5776 commented 4 years ago

@duguiming111 用的是源代码里面的处理方式，我记得中英文都是用BPE处理的，没用jieba分词

duguiming111 commented 4 years ago

好的谢谢我再试试

duguiming111 commented 4 years ago

我设置batch size大小为32，8个epoch，BLEU始终在0.7左右震荡，请问你的batch大小是多少呢？机器的显存多大呢？

HAOHAOXUEXI5776 commented 4 years ago

@duguiming111 我记得是占满了一块12G的卡

Kyubyong / transformer

[BLEU is 0] Use the code on dialogue generation, find the inference result is very bad and bleu score is near 0. #108