facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Trouble with prepare-iwslt14.sh #1493

Closed smpotdar closed 2 years ago

smpotdar commented 4 years ago

I am following the tutorial given on this link and running the following:

CUDA_VISIBLE_DEVICES=0 fairseq-train \ data-bin/iwslt14.tokenized.de-en \ --arch transformer_iwslt_de_en --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096

Getting the following errors, can anyone help?

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_iwslt_de_en', attention_dropout=0.0, best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/iwslt14.tokenized.de-en', dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, empty_cache_freq=0, encoder_attention_heads=4, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, layer_wise_attention=False, layernorm_embedding=False, lazy_load=False, left_pad_source='True', left_pad_target='False', load_alignments=False, log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4096, max_tokens_valid=4096, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_token_positional_embeddings=False, num_workers=1, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=-1, warmup_updates=4000, weight_decay=0.0001) | [de] dictionary: 40 types | [en] dictionary: 40 types Traceback (most recent call last): File "/home/smpotdar/anaconda3/envs/pytorch/bin/fairseq-train", line 11, in load_entry_point('fairseq==0.9.0', 'console_scripts', 'fairseq-train')() File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/train.py", line 333, in cli_main main(args) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/train.py", line 48, in main task.load_dataset(valid_sub_split, combine=False, epoch=0) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq/tasks/translation.py", line 219, in load_dataset truncate_source=self.args.truncate_source, File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq/tasks/translation.py", line 54, in load_langpair_dataset src_dataset = data_utils.load_indexed_dataset(prefix + src, src_dict, dataset_impl) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq/data/data_utils.py", line 77, in load_indexed_dataset dictionary=dictionary, File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq/data/indexed_dataset.py", line 60, in make_dataset return MMapIndexedDataset(path) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq/data/indexed_dataset.py", line 448, in init self._do_init(path) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq/data/indexed_dataset.py", line 461, in _do_init self._bin_buffer_mmap = np.memmap(data_file_path(self._path), mode='r', order='C') File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in new mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) ValueError: cannot mmap an empty file Exception ignored in: <bound method MMapIndexedDataset.del of <fairseq.data.indexed_dataset.MMapIndexedDataset object at 0x7fa53dd9ce10>> Traceback (most recent call last): File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq/data/indexed_dataset.py", line 465, in del self._bin_buffer_mmap._mmap.close() AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'

myleott commented 4 years ago

The error is actually "ValueError: cannot mmap an empty file". Are you sure you preprocessed the data correctly?

smpotdar commented 4 years ago

Okay I realized there is something wrong with my pre-processing. I am running the following commands: TEXT=examples/translation/iwslt14.tokenized.de-en fairseq-preprocess --source-lang de --target-lang en --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir data-bin/iwslt14.tokenized.de-en --workers 2 I get the following output. What am I doing wrong?

Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/iwslt14.tokenized.de-en', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='de', srcdict=None, target_lang='en', task='translation', tensorboard_logdir='', testpref='examples/translation/iwslt14.tokenized.de-en/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='examples/translation/iwslt14.tokenized.de-en/train', user_dir=None, validpref='examples/translation/iwslt14.tokenized.de-en/valid', workers=2) | [de] Dictionary: 39 types | [de] examples/translation/iwslt14.tokenized.de-en/train.de: 1 sents, 48 tokens, 0.0% replaced by | [de] Dictionary: 39 types Traceback (most recent call last): File "/home/smpotdar/anaconda3/envs/pytorch/bin/fairseq-preprocess", line 11, in load_entry_point('fairseq==0.9.0', 'console_scripts', 'fairseq-preprocess')() File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 346, in cli_main main(args) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 245, in main make_all(args.source_lang, src_dict) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 231, in make_all make_dataset(vocab, validpref, outprefix, lang, num_workers=args.workers) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 223, in make_dataset make_binary_dataset(vocab, input_prefix, output_prefix, lang, num_workers) File "/home/smpotdar/anaconda3/envs/pytorch/lib/python3.6/site-packages/fairseq_cli/preprocess.py", line 155, in make_binary_dataset 100 * sum(replaced.values()) / n_seq_tok[1], ZeroDivisionError: division by zero

myleott commented 4 years ago

It looks like examples/translation/iwslt14.tokenized.de-en/train.de is empty... why is that?

smpotdar commented 4 years ago

I don't know why that is happening. I used bash prepare-iwslt14.sh. Which in my opinion should download the dataset and I used the above commands to pre-process. I am getting the following sizes of the files. Apparently there is nothing in the train.de and train.en. I don't understand why:

tmp

code 69 B test.de 1.76 kB test.en 852 kB train.de 180 B train.en 145 B valid.de 0 B valid.en 0 B

Am I missing a step such as downloading the data separately and using the above scripts?

edunov commented 4 years ago

bash prepare-iwslt14.sh should be enough. Can you please pipe the output of this command into a file and then copy it here? Are you behind a proxy by any chance?

smpotdar commented 4 years ago

Here you go. This what I get as output;

Cloning Moses github repository (for tokenization scripts)... Cloning into 'mosesdecoder'... remote: Enu merating objects: 175, done.

remote: Counting objects: 100% (175/175), done. remote: Compressing objects: 100% (93/93), done. remote: Total 147470 (delta 109), reused 121 (delta 76), pack-reused 147295 Receiving objects: 100% (147470/147470), 129.73 MiB | 23.66 MiB/s, done. Resolving deltas: 100% (113943/113943), done. Checking connectivity... done. Checking out files: 100% (3467/3467), done. Cloning Subword NMT repository (for BPE pre-processing)... Cloning into 'subword-nmt'... remote: Enumerating objects: 29, done. remote: Counting objects: 100% (29/29), done. remote: Compressing objects: 100% (22/22), done. remote: Total 538 (delta 10), reused 21 (delta 7), pack-reused 509 Receiving objects: 100% (538/538), 226.98 KiB | 0 bytes/s, done. Resolving deltas: 100% (316/316), done. Checking connectivity... done. Downloading data from https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz... --2019-12-13 15:56:00-- https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz Resolving wit3.fbk.eu (wit3.fbk.eu)... 217.77.80.8 Connecting to wit3.fbk.eu (wit3.fbk.eu)|217.77.80.8|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 19982877 (19M) [application/x-gzip] Saving to: 'de-en.tgz'

de-en.tgz 100%[=================================>] 19.06M 4.51MB/s in 4.2s

2019-12-13 15:56:05 (4.51 MB/s) - 'de-en.tgz' saved [19982877/19982877]

Data successfully downloaded. de-en/ de-en/IWSLT14.TED.dev2010.de-en.de.xml de-en/IWSLT14.TED.dev2010.de-en.en.xml de-en/IWSLT14.TED.tst2010.de-en.de.xml de-en/IWSLT14.TED.tst2010.de-en.en.xml de-en/IWSLT14.TED.tst2011.de-en.de.xml de-en/IWSLT14.TED.tst2011.de-en.en.xml de-en/IWSLT14.TED.tst2012.de-en.de.xml de-en/IWSLT14.TED.tst2012.de-en.en.xml de-en/IWSLT14.TEDX.dev2012.de-en.de.xml de-en/IWSLT14.TEDX.dev2012.de-en.en.xml de-en/README de-en/train.en de-en/train.tags.de-en.de de-en/train.tags.de-en.en pre-processing train data... Tokenizer Version 1.1 Language: de Number of threads: 8

Tokenizer Version 1.1 Language: en Number of threads: 8

clean-corpus.perl: processing iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.de & .en to iwslt14.tokenized.de-en/tmp/train.tags.de-en.clean, cutoff 1-175, ratio 1.5

iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.en is too long! at mosesdecoder/scripts/training/clean-corpus-n.perl line 154, line 4. pre-processing valid/test data... orig/de-en/IWSLT14.TED.dev2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8

orig/de-en/IWSLT14.TED.tst2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8

orig/de-en/IWSLT14.TED.tst2011.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8

orig/de-en/IWSLT14.TED.tst2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8

orig/de-en/IWSLT14.TEDX.dev2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.de Tokenizer Version 1.1 Language: de Number of threads: 8

orig/de-en/IWSLT14.TED.dev2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8

orig/de-en/IWSLT14.TED.tst2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8

orig/de-en/IWSLT14.TED.tst2011.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8

orig/de-en/IWSLT14.TED.tst2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8

orig/de-en/IWSLT14.TEDX.dev2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.en Tokenizer Version 1.1 Language: en Number of threads: 8

creating train, valid, test... learn_bpe.py on iwslt14.tokenized.de-en/tmp/train.en-de... no pair has frequency >= 2. Stopping apply_bpe.py to train.de... apply_bpe.py to valid.de... apply_bpe.py to test.de... apply_bpe.py to train.en... apply_bpe.py to valid.en... apply_bpe.py to test.en...

myleott commented 4 years ago

I'm copying my log below. Compared to yours, you seem to have a moses error: iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.en is too long! at mosesdecoder/scripts/training/clean-corpus-n.perl line 154, line 4.

Can you try cloning a new copy of moses?

Cloning Moses github repository (for tokenization scripts)...
Cloning into 'mosesdecoder'...
remote: Enumerating objects: 181, done.
remote: Counting objects: 100% (181/181), done.
remote: Compressing objects: 100% (99/99), done.
remote: Total 147476 (delta 114), reused 122 (delta 76), pack-reused 147295
Receiving objects: 100% (147476/147476), 129.73 MiB | 21.94 MiB/s, done.
Resolving deltas: 100% (113948/113948), done.
Checking out files: 100% (3467/3467), done.
Cloning Subword NMT repository (for BPE pre-processing)...
Cloning into 'subword-nmt'...
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (22/22), done.
remote: Total 538 (delta 10), reused 21 (delta 7), pack-reused 509
Receiving objects: 100% (538/538), 226.98 KiB | 1.33 MiB/s, done.
Resolving deltas: 100% (316/316), done.
Downloading data from https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz...
--2019-12-16 09:33:43--  https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz
Resolving wit3.fbk.eu (wit3.fbk.eu)... 217.77.80.8
Connecting to wit3.fbk.eu (wit3.fbk.eu)|217.77.80.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19982877 (19M) [application/x-gzip]
Saving to: ‘de-en.tgz’

de-en.tgz                                                                                       100%[=====================================================================================================================================================================================================================================================>]  19.06M   340KB/s    in 34s

2019-12-16 09:34:18 (569 KB/s) - ‘de-en.tgz’ saved [19982877/19982877]

Data successfully downloaded.
de-en/
de-en/IWSLT14.TED.dev2010.de-en.de.xml
de-en/IWSLT14.TED.dev2010.de-en.en.xml
de-en/IWSLT14.TED.tst2010.de-en.de.xml
de-en/IWSLT14.TED.tst2010.de-en.en.xml
de-en/IWSLT14.TED.tst2011.de-en.de.xml
de-en/IWSLT14.TED.tst2011.de-en.en.xml
de-en/IWSLT14.TED.tst2012.de-en.de.xml
de-en/IWSLT14.TED.tst2012.de-en.en.xml
de-en/IWSLT14.TEDX.dev2012.de-en.de.xml
de-en/IWSLT14.TEDX.dev2012.de-en.en.xml
de-en/README
de-en/train.en
de-en/train.tags.de-en.de
de-en/train.tags.de-en.en
pre-processing train data...
Tokenizer Version 1.1
Language: de
Number of threads: 8

Tokenizer Version 1.1
Language: en
Number of threads: 8

clean-corpus.perl: processing iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.de & .en to iwslt14.tokenized.de-en/tmp/train.tags.de-en.clean, cutoff 1-175, ratio 1.5
..........(100000).......
Input sentences: 174443  Output sentences:  167522
pre-processing valid/test data...
orig/de-en/IWSLT14.TED.dev2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8

orig/de-en/IWSLT14.TED.tst2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8

orig/de-en/IWSLT14.TED.tst2011.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8

orig/de-en/IWSLT14.TED.tst2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8

orig/de-en/IWSLT14.TEDX.dev2012.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.de
Tokenizer Version 1.1
Language: de
Number of threads: 8

orig/de-en/IWSLT14.TED.dev2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8

orig/de-en/IWSLT14.TED.tst2010.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8

orig/de-en/IWSLT14.TED.tst2011.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2011.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8

orig/de-en/IWSLT14.TED.tst2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2012.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8

orig/de-en/IWSLT14.TEDX.dev2012.de-en.en.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TEDX.dev2012.de-en.en
Tokenizer Version 1.1
Language: en
Number of threads: 8

creating train, valid, test...
learn_bpe.py on iwslt14.tokenized.de-en/tmp/train.en-de...
apply_bpe.py to train.de...
apply_bpe.py to valid.de...
apply_bpe.py to test.de...
apply_bpe.py to train.en...
apply_bpe.py to valid.en...
apply_bpe.py to test.en...
smpotdar commented 4 years ago

I tried that. I am still getting the same statement:

iwslt14.tokenized.de-en/tmp/train.tags.de-en.tok.en is too long! at mosesdecoder/scripts/training/clean-corpus-n.perl line 154, <E> line 4.

MovingKyu commented 3 years ago

It is about multiprocessing. Set number of thread to 1 in prepare.sh file

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!