AssertionError: Source and target languages should be provided

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.37k stars 6.4k forks source link

AssertionError: Source and target languages should be provided #86

Closed youshimanon closed 6 years ago

youshimanon commented 6 years ago

When I was training the model, I got the error:

Traceback (most recent call last): File "train.py", line 269, in main() File "train.py", line 51, in main dataset = data.load_raw_text_dataset(args.data, splits, args.source_lang, args.target_lang) File "/scratch/jiajie.ding/module/fairseq-py/fairseq/data.py", line 103, in load_raw_text_dataset assert src is not None and dst is not None, 'Source and target languages should be provided' AssertionError: Source and target languages should be provided

What does this mean? And how do I correct this? Thanks

myleott commented 6 years ago

This happens if the source and target language can't be inferred automatically. Typically the language direction is inferred based on the directory/naming structure. For example, if your data directory contains files: train.de-en.de.bin, train.de-en.de.idx, train.de-en.en.bin, train.de-en.en.idx, then we assume that the source language is "de" and the target language is "en".

Maybe you're using a different naming/directory structure than the default? You can specify the languages explicitly with the --source-lang and --target-lang options.

youshimanon commented 6 years ago

I'm using the same naming/directory structure as the default. And I have specified the languages with --source-lang de and --target-lang en. The commands are: cd $PBS_O_WORKDIR mkdir -p checkpoints/fconv CUDA_VISIBLE_DEVICES=0 python train.py data/iwslt14.tokenized.de-en --source-lang de --target-lang en --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

Is there anything wrong?

myleott commented 6 years ago

Ah, it seems you may not have preprocessed the dataset. Please run preprocess.py using the instructions in the README, and then rerun train.py with the path to the preprocessed directory (it should contain several .bin and .idx files).