Closed smpotdar closed 2 years ago
The error is actually "ValueError: cannot mmap an empty file". Are you sure you preprocessed the data correctly?
Okay I realized there is something wrong with my pre-processing.
I am running the following commands:
TEXT=examples/translation/iwslt14.tokenized.de-en fairseq-preprocess --source-lang de --target-lang en --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir data-bin/iwslt14.tokenized.de-en --workers 2
I get the following output. What am I doing wrong?
Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/iwslt14.tokenized.de-en', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='de', srcdict=None, target_lang='en', task='translation', tensorboard_logdir='', testpref='examples/translation/iwslt14.tokenized.de-en/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='examples/translation/iwslt14.tokenized.de-en/train', user_dir=None, validpref='examples/translation/iwslt14.tokenized.de-en/valid', workers=2) | [de] Dictionary: 39 types | [de] examples/translation/iwslt14.tokenized.de-en/train.de: 1 sents, 48 tokens, 0.0% replaced by
| [de] Dictionary: 39 types Traceback (most recent call last): File "/home/smpotdar/anaconda3/envs/pytorch/bin/fairseq-preprocess", line 11, in load_entry_point('fairseq==0.9.0', 'console_scripts', 'fairseq-preprocess')() File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 346, in cli_main main(args) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 245, in main make_all(args.source_lang, src_dict) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 231, in make_all make_dataset(vocab, validpref, outprefix, lang, num_workers=args.workers) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 223, in make_dataset make_binary_dataset(vocab, input_prefix, output_prefix, lang, num_workers) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 155, in make_binary_dataset 100 * sum(replaced.values()) / n_seq_tok[1], ZeroDivisionError: division by zero
It looks like examples/translation/iwslt14.tokenized.de-en/train.de
is empty... why is that?
I don't know why that is happening. I used bash prepare-iwslt14.sh
. Which in my opinion should download the dataset and I used the above commands to pre-process. I am getting the following sizes of the files. Apparently there is nothing in the train.de and train.en. I don't understand why:
tmp
code 69 B test.de 1.76 kB test.en 852 kB train.de 180 B train.en 145 B valid.de 0 B valid.en 0 B
Am I missing a step such as downloading the data separately and using the above scripts?
bash prepare-iwslt14.sh should be enough. Can you please pipe the output of this command into a file and then copy it here? Are you behind a proxy by any chance?
Here you go. This what I get as output;
Cloning Moses github repository (for tokenization scripts)... Cloning into 'mosesdecoder'... remote: Enu merating objects: 175, done.
remote: Counting objects: 100% (175/175), done. remote: Compressing objects: 100% (93/93), done. remote: Total 147470 (delta 109), reused 121 (delta 76), pack-reused 147295 Receiving objects: 100% (147470/147470), 129.73 MiB | 23.66 MiB/s, done. Resolving deltas: 100% (113943/113943), done. Checking connectivity... done. Checking out files: 100% (3467/3467), done. Cloning Subword NMT repository (for BPE pre-processing)... Cloning into 'subword-nmt'... remote: Enumerating objects: 29, done. remote: Counting objects: 100% (29/29), done. remote: Compressing objects: 100% (22/22), done. remote: Total 538 (delta 10), reused 21 (delta 7), pack-reused 509 Receiving objects: 100% (538/538), 226.98 KiB | 0 bytes/s, done. Resolving deltas: 100% (316/316), done. Checking connectivity... done. Downloading data from https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz... --2019-12-13 15:56:00-- https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz Resolving wit3.fbk.eu (wit3.fbk.eu)... 217.77.80.8 Connecting to wit3.fbk.eu (wit3.fbk.eu)|217.77.80.8|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 19982877 (19M) [application/x-gzip] Saving to: 'de-en.tgz'
de-en.tgz 100%[=================================>] 19.06M 4.51MB/s in 4.2s
2019-12-13 15:56:05 (4.51 MB/s) - 'de-en.tgz' saved [19982877/19982877]
Data successfully downloaded. de-en/ de-en/IWSLT14.TED.dev2010.de-en.de.xml de-en/IWSLT14.TED.dev2010.de-en.en.xml de-en/IWSLT14.TED.tst2010.de-en.de.xml de-en/IWSLT14.TED.tst2010.de-en.en.xml de-en/IWSLT14.TED.tst2011.de-en.de.xml de-en/IWSLT14.TED.tst2011.de-en.en.xml de-en/IWSLT14.TED.tst2012.de-en.de.xml de-en/IWSLT14.TED.tst2012.de-en.en.xml de-en/IWSLT14.TEDX.dev2012.de-en.de.xml de-en/IWSLT14.TEDX.dev2012.de-en.en.xml de-en/README de-en/train.en de-en/train.tags.de-en.de de-en/train.tags.de-en.en pre-processing train data... Tokenizer Version 1.1 Language: de Number of threads: 8
Tokenizer Version 1.1 Language: en Number of threads: 8
clean-corpus.perl: processing iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.de & .en to iwslt14.tokenized.de-en/tmp/train.tags.de-en.clean, cutoff 1-175, ratio 1.5
iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.en is too long! at mosesdecoder/scripts/training/clean-corpus-n.perl line 154,
line 4. pre-processing valid/test data... orig/de-en/IWSLT14.TED.dev2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8 orig/de-en/IWSLT14.TED.tst2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8
orig/de-en/IWSLT14.TED.tst2011.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8
orig/de-en/IWSLT14.TED.tst2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8
orig/de-en/IWSLT14.TEDX.dev2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8
orig/de-en/IWSLT14.TED.dev2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8
orig/de-en/IWSLT14.TED.tst2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8
orig/de-en/IWSLT14.TED.tst2011.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8
orig/de-en/IWSLT14.TED.tst2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8
orig/de-en/IWSLT14.TEDX.dev2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8
creating train, valid, test... learn_bpe.py on iwslt14.tokenized.de-en/tmp/train.en-de... no pair has frequency >= 2. Stopping apply_bpe.py to train.de... apply_bpe.py to valid.de... apply_bpe.py to test.de... apply_bpe.py to train.en... apply_bpe.py to valid.en... apply_bpe.py to test.en...
I'm copying my log below. Compared to yours, you seem to have a moses error: iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.en is too long! at mosesdecoder/scripts/training/clean-corpus-n.perl line 154, line 4.
Can you try cloning a new copy of moses?
Cloning Moses github repository (for tokenization scripts)...
Cloning into 'mosesdecoder'...
remote: Enumerating objects: 181, done.
remote: Counting objects: 100% (181/181), done.
remote: Compressing objects: 100% (99/99), done.
remote: Total 147476 (delta 114), reused 122 (delta 76), pack-reused 147295
Receiving objects: 100% (147476/147476), 129.73 MiB | 21.94 MiB/s, done.
Resolving deltas: 100% (113948/113948), done.
Checking out files: 100% (3467/3467), done.
Cloning Subword NMT repository (for BPE pre-processing)...
Cloning into 'subword-nmt'...
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (22/22), done.
remote: Total 538 (delta 10), reused 21 (delta 7), pack-reused 509
Receiving objects: 100% (538/538), 226.98 KiB | 1.33 MiB/s, done.
Resolving deltas: 100% (316/316), done.
Downloading data from https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz...
--2019-12-16 09:33:43-- https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz
Resolving wit3.fbk.eu (wit3.fbk.eu)... 217.77.80.8
Connecting to wit3.fbk.eu (wit3.fbk.eu)|217.77.80.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19982877 (19M) [application/x-gzip]
Saving to: ‘de-en.tgz’
de-en.tgz 100%[=====================================================================================================================================================================================================================================================>] 19.06M 340KB/s in 34s
2019-12-16 09:34:18 (569 KB/s) - ‘de-en.tgz’ saved [19982877/19982877]
Data successfully downloaded.
de-en/
de-en/IWSLT14.TED.dev2010.de-en.de.xml
de-en/IWSLT14.TED.dev2010.de-en.en.xml
de-en/IWSLT14.TED.tst2010.de-en.de.xml
de-en/IWSLT14.TED.tst2010.de-en.en.xml
de-en/IWSLT14.TED.tst2011.de-en.de.xml
de-en/IWSLT14.TED.tst2011.de-en.en.xml
de-en/IWSLT14.TED.tst2012.de-en.de.xml
de-en/IWSLT14.TED.tst2012.de-en.en.xml
de-en/IWSLT14.TEDX.dev2012.de-en.de.xml
de-en/IWSLT14.TEDX.dev2012.de-en.en.xml
de-en/README
de-en/train.en
de-en/train.tags.de-en.de
de-en/train.tags.de-en.en
pre-processing train data...
Tokenizer Version 1.1
Language: de
Number of threads: 8
Tokenizer Version 1.1
Language: en
Number of threads: 8
clean-corpus.perl: processing iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.de & .en to iwslt14.tokenized.de-en/tmp/train.tags.de-en.clean, cutoff 1-175, ratio 1.5
..........(100000).......
Input sentences: 174443 Output sentences: 167522
pre-processing valid/test data...
orig/de-en/IWSLT14.TED.dev2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8
orig/de-en/IWSLT14.TED.tst2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8
orig/de-en/IWSLT14.TED.tst2011.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8
orig/de-en/IWSLT14.TED.tst2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8
orig/de-en/IWSLT14.TEDX.dev2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8
orig/de-en/IWSLT14.TED.dev2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8
orig/de-en/IWSLT14.TED.tst2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8
orig/de-en/IWSLT14.TED.tst2011.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8
orig/de-en/IWSLT14.TED.tst2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8
orig/de-en/IWSLT14.TEDX.dev2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8
creating train, valid, test...
learn_bpe.py on iwslt14.tokenized.de-en/tmp/train.en-de...
apply_bpe.py to train.de...
apply_bpe.py to valid.de...
apply_bpe.py to test.de...
apply_bpe.py to train.en...
apply_bpe.py to valid.en...
apply_bpe.py to test.en...
I tried that. I am still getting the same statement:
iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.en is too long! at mosesdecoder/scripts/training/clean-corpus-n.perl line 154, <E> line 4.
It is about multiprocessing. Set number of thread to 1 in prepare.sh file
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
I am following the tutorial given on this link and running the following:
CUDA_VISIBLE_DEVICES=0 fairseq-train \ data-bin/iwslt14.tokenized.de-en \ --arch transformer_iwslt_de_en --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096
Getting the following errors, can anyone help?