Closed WJMacro closed 11 months ago
Thanks for raising the issue! We have re-uploaded the binaried multi-domain dataset. Please check it out in README file. If you need further help, please let me know.
Thanks for your timely and effective reply! Since I would like to try to replicate your work in other language pairs or domains I wonder if you could provide scripts for preprocessing data.
Hi, there!
We've updated the README file, where we provide instructions to retrieve reference samples and preprocess the data. Note that the scripts for THUMT and Fairseq framework are not exactly the same.
Thanks for your great code! However, I found something wrong with the binarized data you provided for fairseq. According to the preprocess.log, the binarized data appears to be a preprocessed wikitext-103 dataset for LM task.
[None] Dictionary: 267743 types [None] /apdcephfs/share_916081/dirkiedai/datasets/wikitext-103/wiki.test.tokens: 4358 sents, 245569 tokens, 0.0% replaced by <unk> Wrote preprocessed data to /apdcephfs/share_916081/dirkiedai/data-bin/wikitext-103
I tried to do the preprocess myself but failed to parse the required text data format from the code. Could you please re-upload the correct dataset or release the script for fairseq preprocessing? Looking forward to your reply.