Open luisarmandom opened 2 years ago
The actual reason why fairseq/examples/translation exists is you can actually see the raw data. Thus you can understand what structure is expected by fairseq-preprocess. Also fairseq does not require a gpu to run, you can run it on cpu with a tiny size model in order to debug fairseq with vscode.
-- To produce a fairseq data:
Assuming you are reading translation:training-a-new-model , through running the bash you will obtain folder: iwslt14.tokenized.de-en. Open it and you shall see train/valid/test.en English data and .de German data. You will find out
So if you have knwoledge in tokenizer/sentencepiece, the above fact tells you the data format:
fairseq-preprocess will turn tokenized data into binary files which reduce file size and helps fairseq-train/generate to read them correctly.
Also fairseq-preprocess requires {source_language} dictionary.txt and {target_language} dictionary.txt to turn tokens into numbers (then to binary data). The first example does not provide --srcdict
while in Multilingual Translation section it tells you that you can reuse a dict.txt by specifying --srcdict PATH_TO_TXT
and --tgtdict PATH_TO_TXT
. If you do not specify dict, fairseq-preprocess will create the dict from the given data. (not recommended if you have an actual big project to run, but okay if you just want program progress).
for colab, if you have installed fairseq by !pip install fairseq
(I remember you have to add ! for running like command line)
you can run with !fairseq-preprocess --trainpref .....
--
For people who want to knwo how fairseq-preprocess, train, generate works, use vscode or any editor that allows you to set breakpoint easily and check variables' values when in break. Run it with very small size model ( like 2 layers 2 layers, 64 dimension, 256 ffn or even smaller).
❓ Questions and Help
What is your question?
I have 2 separate files, each having sentences in their respective languages. Since, there is no documentation to explain how and in which format should I pre-process the dataset so as to run the training, I am struggling to figure it out.
I found this similar question: https://github.com/facebookresearch/fairseq/issues/411 and the answer was "We provide example commands: https://github.com/pytorch/fairseq/tree/master/examples/translation"
However, that's not helping me because you are still using the pre-trained models. I have no idea how to:
I'm assuming this is to be run within the
fairseq/
root directory and still have no idea what the dataset I give it should look like (right now I either have the original csv file, the separate {train, test, dev}.{src, tgt} files or the bpe files themselves.What's your environment?
I'm running this on Google Colab because I don't personally have a GPU