Does fairseq-interactive and fairseq-genetate generate different results?

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.19k stars 6.38k forks source link

Does fairseq-interactive and fairseq-genetate generate different results? #568

Closed gxzks closed 5 years ago

gxzks commented 5 years ago

I used fairseq-interactive and fairseq-generate respectively to decode the same file, but the result is slighly different. The result generated from fairseq-generate outperformed the result from fairseq-interactivate by about 0.3 bleu scores. All other parameters were set to be same.

myleott commented 5 years ago

Can you post the exact commands you ran? Also what model architecture is this?

gxzks commented 5 years ago

fairseq-interactive command: CUDA_VISIBLE_DEVICES=0 fairseq-interactive $BIN_DATA \ --path "${SAVE}/checkpoint_last.pt" \ --buffer-size 64 --beam 4 --lenpen 0.6 --remove-bpe sentencepiece \ --input $root_path/subword_data/newsdev2017/spm.newsdev2017.en \ --output newsdev2017.en.decodes.avg \ --user-dir $USR_DIR

fairseq-generate command: CUDA_VISIBLE_DEVICES=0 fairseq-generate $BIN_DATA \ --path "${SAVE}/checkpoint_last10_avg.pt" \ --batch-size 128 --beam 4 --lenpen 0.6 --remove-bpe sentencepiece \ --gen-subset valid \ --user-dir $USR_DIR

@myleott , as for the model, I used my own model architecture which is a little different with origin transformer. I changed the source code of interactive.py to write the result into the file passed by --output like this:
if args.output != '': fout.write(hypo.split('\t')[-1] + '\n')

myleott commented 5 years ago

If you're using your own architecture, make sure that it's properly handling/ignoring padding. For example, make sure you're not attending over padding symbols [1]. Also make sure you've implemented reorder_encoder_out [2] and reorder_incremental_state [3] if needed.

One way to test if this is the problem is to change batch size to 1 and see if you still get different results between the two.

[1] https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer.py#L322-L325 [2] https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer.py#L339-L356 [3] https://github.com/pytorch/fairseq/blob/master/fairseq/models/fconv.py#L541-L546

gxzks commented 5 years ago

Thanks for your quick reply. @myleott I'll check my code again to see if there are any implement I missed.

gxzks commented 5 years ago

@myleott The results keep same with batch-size=1 and batch-size=128. And my code is same with origin transformer in fairseq for padding and reorder_encoder_out part. Therefore, fairseq-generate still outperforms fairseq-interactive. Does generate has any optimization at inference? Or reading from binarized file is a little different from reading from raw text?

mali-nuist commented 5 years ago

Did you tokenize the raw text just like the binarized file?

gxzks commented 5 years ago

The raw text I used for fairseq-interactive is the file I used for fairseq-preprocess. I also checked the output of S- of fairseq-generate, and it was also same as the raw text.

mali-nuist commented 5 years ago

@gxzks The file used for fariseq-process would be tokeinzed using bpe by the preprocess script. However, you did not tokenize the raw text and directly regarded it as the input of fairseq-interactive ?.... maybe , i guess ...

gxzks commented 5 years ago

Found the point. The sentencepiece segmentation will change some special tokens in the reference file, so the output of reference of fairseq-generate is slightly different from origin reference. Sorry for my idiot mistake...