Wrong binarized data for fairseq

WJMacro commented 1 year ago

Thanks for your great code! However, I found something wrong with the binarized data you provided for fairseq. According to the preprocess.log, the binarized data appears to be a preprocessed wikitext-103 dataset for LM task.

[None] Dictionary: 267743 types [None] /apdcephfs/share_916081/dirkiedai/datasets/wikitext-103/wiki.test.tokens: 4358 sents, 245569 tokens, 0.0% replaced by <unk> Wrote preprocessed data to /apdcephfs/share_916081/dirkiedai/data-bin/wikitext-103

I tried to do the preprocess myself but failed to parse the required text data format from the code. Could you please re-upload the correct dataset or release the script for fairseq preprocessing? Looking forward to your reply.

dirkiedai commented 1 year ago

Thanks for raising the issue! We have re-uploaded the binaried multi-domain dataset. Please check it out in README file. If you need further help, please let me know.

WJMacro commented 1 year ago

Thanks for your timely and effective reply! Since I would like to try to replicate your work in other language pairs or domains I wonder if you could provide scripts for preprocessing data.

dirkiedai commented 1 year ago

Hi, there!

We've updated the README file, where we provide instructions to retrieve reference samples and preprocess the data. Note that the scripts for THUMT and Fairseq framework are not exactly the same.

dirkiedai / sk-mt

Wrong binarized data for fairseq #1