How to use a new dataset to train for a new language pair ?

❓ Questions and Help

What is your question?

I have 2 separate files, each having sentences in their respective languages. Since, there is no documentation to explain how and in which format should I pre-process the dataset so as to run the training, I am struggling to figure it out.

I found this similar question: https://github.com/facebookresearch/fairseq/issues/411 and the answer was "We provide example commands: https://github.com/pytorch/fairseq/tree/master/examples/translation"

However, that's not helping me because you are still using the pre-trained models. I have no idea how to:

Give the data I have so it can be preprocessed. I do assume it has something to do with this command:

DATASET= /path/to/dataset
fairseq-preprocess \
--only-source \
--trainpref $DATASET/train.txt \
--validpref $DATASET/valid.txt \
--testpref $DATASET/test.txt \
--destdir data-bin/summary \
--workers 20

I'm assuming this is to be run within the fairseq/ root directory and still have no idea what the dataset I give it should look like (right now I either have the original csv file, the separate {train, test, dev}.{src, tgt} files or the bpe files themselves.

What's your environment?

I'm running this on Google Colab because I don't personally have a GPU

The actual reason why fairseq/examples/translation exists is you can actually see the raw data. Thus you can understand what structure is expected by fairseq-preprocess. Also fairseq does not require a gpu to run, you can run it on cpu with a tiny size model in order to debug fairseq with vscode.

-- To produce a fairseq data:

Assuming you are reading translation:training-a-new-model , through running the bash you will obtain folder: iwslt14.tokenized.de-en. Open it and you shall see train/valid/test.en English data and .de German data. You will find out

every line in .en is one data corresponding to the same line in .de,
there are " " between tokens( or words/chars ) You can also find folder: orig , which holds the original data.

So if you have knwoledge in tokenizer/sentencepiece, the above fact tells you the data format:

raw data (or sentences written in human-readable way) must be tokenized first by either using tokenizer or your own rule/programs.
each line must be one entry of data. Each line in data file .{source_language} must correspond to the same line in .{target_language}
tokens are seperated and distinguished by " "(whitespace) for fairseq-preprocess

fairseq-preprocess will turn tokenized data into binary files which reduce file size and helps fairseq-train/generate to read them correctly. Also fairseq-preprocess requires {source_language} dictionary.txt and {target_language} dictionary.txt to turn tokens into numbers (then to binary data). The first example does not provide --srcdict while in Multilingual Translation section it tells you that you can reuse a dict.txt by specifying --srcdict PATH_TO_TXT and --tgtdict PATH_TO_TXT. If you do not specify dict, fairseq-preprocess will create the dict from the given data. (not recommended if you have an actual big project to run, but okay if you just want program progress).

for colab, if you have installed fairseq by !pip install fairseq (I remember you have to add ! for running like command line) you can run with !fairseq-preprocess --trainpref .....

For people who want to knwo how fairseq-preprocess, train, generate works, use vscode or any editor that allows you to set breakpoint easily and check variables' values when in break. Run it with very small size model ( like 2 layers 2 layers, 64 dimension, 256 ffn or even smaller).

facebookresearch / fairseq