facebookresearch / access

Code to reproduce the experiments from the paper.
Other
101 stars 36 forks source link

python scripts/train.py #11

Closed qiunlp closed 4 years ago

qiunlp commented 4 years ago

Sorry, ask your help again.

1) python scripts/evaluate.py -------------ok 2) python scripts/generate.py < my_file.complex--------------ok

3) python scripts/train.py-------------error Training a model from scratch method_name='fairseq_train_and_evaluate' args=() kwargs={'arch': 'transformer', 'warmup_updates': 4000, 'parametrization_budget': 256, 'beam': 8, 'dataset': 'wikilarge', 'dropout': 0.2, 'fp16': False, 'label_smoothing': 0.54, 'lr': 0.00011, 'lr_scheduler': 'fixed', 'max_epoch': 100, 'max_tokens': 5000, 'metrics_coefs': [0, 1, 0], 'optimizer': 'adam', 'preprocessors_kwargs': {'LengthRatioPreprocessor': {'target_ratio': 0.8}, 'LevenshteinPreprocessor': {'target_ratio': 0.8}, 'WordRankRatioPreprocessor': {'target_ratio': 0.8}, 'DependencyTreeDepthRatioPreprocessor': {'target_ratio': 0.8}, 'SentencePiecePreprocessor': {'vocab_size': 10000}}} Creating /home/qwh/桌面/access/resources/datasets/wikilarge/fairseq_preprocessed... usage: train.py [-h] [--no-progress-bar] [--log-interval N] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16] [--memory-efficient-fp16] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale D] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--criterion {sentence_prediction,binary_cross_entropy,cross_entropy,sentence_ranking,legacy_masked_lm_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,composite_loss,adaptive_loss,masked_lm,nat_loss}] [--tokenizer {moses,nltk,space}] [--bpe {gpt2,sentencepiece,bert,subword_nmt,fastbpe}] [--optimizer {nag,adam,adafactor,adamax,sgd,adadelta,adagrad}] [--lr-scheduler {cosine,polynomial_decay,triangular,inverse_sqrt,tri_stage,reduce_lr_on_plateau,fixed}] [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N] [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary] [--only-source] [--padding-factor N] [--workers N] train.py: error: unrecognized arguments: --output-format raw Error: Rolling back creation of directory /home/qwh/桌面/access/resources/datasets/wikilarge/fairseq_preprocessed

4) I feel that problem is here: access/fairseq/base.py, and I del '--output-format', 'raw', def fairseq_preprocess(dataset): dataset_dir = get_dataset_dir(dataset) with lock_directory(dataset_dir): preprocessed_dir = dataset_dir / 'fairseq_preprocessed' with create_directory_or_skip(preprocessed_dir): preprocessing_parser = options.get_preprocessing_parser() preprocess_args = preprocessing_parser.parse_args([ '--source-lang', 'complex', '--target-lang', 'simple', '--trainpref', os.path.join(dataset_dir, f'{dataset}.train'), '--validpref', os.path.join(dataset_dir, f'{dataset}.valid'), '--testpref', os.path.join(dataset_dir, f'{dataset}.test'), '--destdir', str(preprocessed_dir), '--output-format', 'raw', ]) preprocess.main(preprocess_args) return preprocessed_dir

5) then I python scripts/train.py Training a model from scratch method_name='fairseq_train_and_evaluate' args=() kwargs={'arch': 'transformer', 'warmup_updates': 4000, 'parametrization_budget': 256, 'beam': 8, 'dataset': 'wikilarge', 'dropout': 0.2, 'fp16': False, 'label_smoothing': 0.54, 'lr': 0.00011, 'lr_scheduler': 'fixed', 'max_epoch': 100, 'max_tokens': 5000, 'metrics_coefs': [0, 1, 0], 'optimizer': 'adam', 'preprocessors_kwargs': {'LengthRatioPreprocessor': {'target_ratio': 0.8}, 'LevenshteinPreprocessor': {'target_ratio': 0.8}, 'WordRankRatioPreprocessor': {'target_ratio': 0.8}, 'DependencyTreeDepthRatioPreprocessor': {'target_ratio': 0.8}, 'SentencePiecePreprocessor': {'vocab_size': 10000}}} usage: train.py [-h] [--no-progress-bar] [--log-interval N] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16] [--memory-efficient-fp16] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale D] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--criterion {sentence_prediction,binary_cross_entropy,cross_entropy,sentence_ranking,legacy_masked_lm_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,composite_loss,adaptive_loss,masked_lm,nat_loss}] [--tokenizer {moses,nltk,space}] [--bpe {gpt2,sentencepiece,bert,subword_nmt,fastbpe}] [--optimizer {nag,adam,adafactor,adamax,sgd,adadelta,adagrad}] [--lr-scheduler {cosine,polynomial_decay,triangular,inverse_sqrt,tri_stage,reduce_lr_on_plateau,fixed}] [--task TASK] [--num-workers N] [--skip-invalid-size-inputs-valid-test] [--max-tokens N] [--max-sentences N] [--required-batch-size-multiple N] [--dataset-impl FORMAT] [--train-subset SPLIT] [--valid-subset SPLIT] [--validate-interval N] [--fixed-validation-seed N] [--disable-validation] [--max-tokens-valid N] [--max-sentences-valid N] [--curriculum N] [--distributed-world-size N] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] --arch ARCH [--max-epoch N] [--max-update N] [--clip-norm NORM] [--sentence-avg] [--update-freq N1,N2,...,N_K] [--lr LR_1,LR_2,...,LR_N] [--min-lr LR] [--use-bmuf] [--save-dir DIR] [--restore-file RESTORE_FILE] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides DICT] [--save-interval N] [--save-interval-updates N] [--keep-interval-updates N] [--keep-last-epochs N] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-save-optimizer-state] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--no-cross-attention] [--cross-self-attention] [--layer-wise-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--layernorm-embedding] [--no-scale-embedding] [--label-smoothing D] [--adam-betas B] [--adam-eps D] [--weight-decay WD] [--force-anneal N] [--lr-shrink LS] [--warmup-updates N] [-s SRC] [-t TARGET] [--lazy-load] [--raw-text] [--load-alignments] [--left-pad-source BOOL] [--left-pad-target BOOL] [--max-source-positions N] [--max-target-positions N] [--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source] data train.py: error: unrecognized arguments: --validations-before-sari-early-stopping 10

6) I debug it, and find it stopped at "access/fairseq/base.py line172"

7)pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification------Problem still is here

louismartin commented 4 years ago

Please make sure that you install the exact versions of packages provided in the requirements.txt file. You can fix this running pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification

qiunlp commented 4 years ago

1)sorry, I have run: pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification,

qwh@qwh-Legion-Y7000-2019-PG0:~/桌面/access$ pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification Requirement already satisfied: fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification from git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification in /home/qwh/.local/lib/python3.6/site-packages (0.9.0) Requirement already satisfied: cffi in /home/qwh/.local/lib/python3.6/site-packages (from fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (1.14.0) Requirement already satisfied: cython in /home/qwh/.local/lib/python3.6/site-packages (from fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (0.29.15) Requirement already satisfied: numpy in /home/qwh/.local/lib/python3.6/site-packages (from fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (1.17.2) Requirement already satisfied: regex in /home/qwh/.local/lib/python3.6/site-packages (from fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (2020.2.20) Requirement already satisfied: sacrebleu in /home/qwh/.local/lib/python3.6/site-packages (from fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (1.4.4) Requirement already satisfied: torch in /home/qwh/.local/lib/python3.6/site-packages (from fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (1.4.0) Requirement already satisfied: tqdm in /home/qwh/.local/lib/python3.6/site-packages (from fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (4.36.1) Requirement already satisfied: pycparser in /home/qwh/.local/lib/python3.6/site-packages (from cffi->fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (2.20) Requirement already satisfied: typing in /home/qwh/.local/lib/python3.6/site-packages (from sacrebleu->fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (3.7.4.1) Requirement already satisfied: portalocker in /home/qwh/.local/lib/python3.6/site-packages (from sacrebleu->fairseq@ git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification) (1.5.2)

2)but: /usr/bin/python3.6 /home/qwh/桌面/access/scripts/train.py Training a model from scratch method_name='fairseq_train_and_evaluate' args=() kwargs={'arch': 'transformer', 'warmup_updates': 4000, 'parametrization_budget': 256, 'beam': 8, 'dataset': 'wikilarge', 'dropout': 0.2, 'fp16': False, 'label_smoothing': 0.54, 'lr': 0.00011, 'lr_scheduler': 'fixed', 'max_epoch': 100, 'max_tokens': 5000, 'metrics_coefs': [0, 1, 0], 'optimizer': 'adam', 'preprocessors_kwargs': {'LengthRatioPreprocessor': {'target_ratio': 0.8}, 'LevenshteinPreprocessor': {'target_ratio': 0.8}, 'WordRankRatioPreprocessor': {'target_ratio': 0.8}, 'DependencyTreeDepthRatioPreprocessor': {'target_ratio': 0.8}, 'SentencePiecePreprocessor': {'vocab_size': 10000}}} usage: train.py [-h] [--no-progress-bar] [--log-interval N] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16] [--memory-efficient-fp16] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale D] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--criterion {sentence_prediction,binary_cross_entropy,cross_entropy,sentence_ranking,legacy_masked_lm_loss,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,composite_loss,adaptive_loss,masked_lm,nat_loss}] [--tokenizer {moses,nltk,space}] [--bpe {gpt2,sentencepiece,bert,subword_nmt,fastbpe}] [--optimizer {nag,adam,adafactor,adamax,sgd,adadelta,adagrad}] [--lr-scheduler {cosine,polynomial_decay,triangular,inverse_sqrt,tri_stage,reduce_lr_on_plateau,fixed}] [--task TASK] [--num-workers N] [--skip-invalid-size-inputs-valid-test] [--max-tokens N] [--max-sentences N] [--required-batch-size-multiple N] [--dataset-impl FORMAT] [--train-subset SPLIT] [--valid-subset SPLIT] [--validate-interval N] [--fixed-validation-seed N] [--disable-validation] [--max-tokens-valid N] [--max-sentences-valid N] [--curriculum N] [--distributed-world-size N] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] --arch ARCH [--max-epoch N] [--max-update N] [--clip-norm NORM] [--sentence-avg] [--update-freq N1,N2,...,N_K] [--lr LR_1,LR_2,...,LR_N] [--min-lr LR] [--use-bmuf] [--save-dir DIR] [--restore-file RESTORE_FILE] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides DICT] [--save-interval N] [--save-interval-updates N] [--keep-interval-updates N] [--keep-last-epochs N] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-save-optimizer-state] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--no-cross-attention] [--cross-self-attention] [--layer-wise-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--layernorm-embedding] [--no-scale-embedding] [--label-smoothing D] [--adam-betas B] [--adam-eps D] [--weight-decay WD] [--force-anneal N] [--lr-shrink LS] [--warmup-updates N] [-s SRC] [-t TARGET] [--lazy-load] [--raw-text] [--load-alignments] [--left-pad-source BOOL] [--left-pad-target BOOL] [--max-source-positions N] [--max-target-positions N] [--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source] data train.py: error: unrecognized arguments: --validations-before-sari-early-stopping 10

Process finished with exit code 2

3)I debug it, and find it stopped at "access/fairseq/base.py line174" conflict_handler='error',

louismartin commented 4 years ago

This option should be available in the given version: https://github.com/louismartin/fairseq/blob/1e30f27b159daced4a7340d0d0d2c590311a6061/fairseq/options.py#L350.

You can check the version of fairseq that you have by running pip freeze | grep fairseq, it should be fairseq==0.6.2.

Mutliple solutions to fix the problem: 1) Reinstall fairseq again pip uninstall fairseq && pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification

2) Install fairseq from source:

 git clone -b controllable-sentence-simplification https://github.com/louismartin/fairseq
cd fairseq
pip setup.py install

3) Investigate what the problem is: python -c "import fairseq; print(fairseq.__file__) Then go into the printed path and look in the options.py file. You find the "--validations-before-sari-early-stopping" parameter, if it is not there then that means that the wrong version is installed.

qiunlp commented 4 years ago

1) pip uninstall fairseq && pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification

2)pip freeze | grep fairseq WARNING: Could not generate requirement for distribution -ltk 3.4.3 (/usr/local/lib/python3.6/dist-packages): Parse error at "'-ltk==3.'": Expected W:(abcd...) fairseq==0.9.0

louismartin commented 4 years ago

What is your version of pip ? pip --version

qiunlp commented 4 years ago

pip 20.0.2 from /home/qwh/.local/lib/python3.6/site-packages/pip (python 3.6)

louismartin commented 4 years ago

That's weird, I have pip 20.0.2 as well and runnning:

pip uninstall fairseq
pip install fairseq@git+https://github.com/louismartin/fairseq.git@controllable-sentence-simplification
pip freeze | grep fairseq

outputs faireq==0.6.2

louismartin commented 4 years ago

I guess you should install fairseq from source then, see point 2).

louismartin commented 4 years ago

This does not seem to be related to ACCESS, I can't help you on this one, you'll have to find to dig to find the answer.

qiunlp commented 4 years ago

It's training............

| epoch 001: 1000 / 2567 loss=12.331, nll_loss=10.524, ppl=1472.39, wps=777, ups=0, wpb=2980.673, bsz=116.965, num_updates=1001, lr=2.75275e-05, gnorm=1.434, clip=1.000, oom=0.000, wall=3845, train_wall=3830 | epoch 001: 2000 / 2567 loss=12.008, nll_loss=9.795, ppl=888.43, wps=780, ups=0, wpb=2958.093, bsz=114.896, num_updates=2001, lr=5.50275e-05, gnorm=1.267, clip=1.000, oom=0.000, wall=7591, train_wall=7568 | epoch 001 | loss 11.874 | nll_loss 9.481 | ppl 714.67 | wps 783 | ups 0 | wpb 2959.828 | bsz 115.466 | num_updates 2567 | lr 7.05925e-05 | gnorm 1.197 | clip 1.000 | oom 0.000 | wall 9712 | train_wall 9685 num_updates=2567 ts_scores={'BLEU': 1.44, 'SARI': 20.41, 'FKGL': 0, 'Compression ratio': 0.7, 'Sentence splits': 1.1, 'Levenshtein similarity': 0.36, 'Exact matches': 0.0, 'Additions proportion': 0.5, 'Deletions proportion': 0.68, 'Lexical complexity score': 5.61}

It seems successful! If I train with only part of the wikilarge , how do I do? Thank you!

angelo-megna94 commented 4 years ago

Hi! Thanks for that advices, it helps me! But i've a "new" problem: after that ->

i've got faireq==0.6.2 as outputs, and that's ok.

But, when i run train.py, this error comes out: _Traceback (most recent call last): File "/Users/angelo/PycharmProjects/Prova/access/scripts/train.py", line 49, in fairseq_train_and_evaluate(kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 18, in wrapped_func return func(*args, kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 29, in wrapped_func return func(*args, *kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 38, in wrapped_func result = func(args, kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 50, in wrapped_func result = func(*args, kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/fairseq/main.py", line 121, in fairseq_train_and_evaluate fairseq_train(preprocessed_dir, exp_dir=exp_dir, train_kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/fairseq/base.py", line 175, in fairseq_train train.main(train_args) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py", line 42, in main load_dataset_splits(task, ['train', 'valid']) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py", line 479, in load_dataset_splits task.load_dataset(split, combine=True) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 166, in load_dataset raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path)) FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseqpreprocessed)**

i can't understand how to fix this. Can you or someone help me?

if it could serve, the _2912c535c2343258d2e6375bca3e3a3d folder is so composed:

Thanks

louismartin commented 4 years ago

Hi, Thanks for the detailed description. Could you please give me more details on the names of the .bin and .idx files of the fairseq_preprocessed dir ?

angelo-megna94 commented 4 years ago

Yes, of course.

test.complex-simple.complex.bin test.complex-simple.complex.idx test.complex-simple.simple.bin test.complex-simple.simple.idx train.complex-simple.complex.bin train.complex-simple.complex.idx train.complex-simple.simple.bin train.complex-simple.simple.idx valid.complex-simple.complex.bin valid.complex-simple.complex.idx valid.complex-simple.simple.bin valid.complex-simple.simple.idx

louismartin commented 4 years ago

Thanks. That's a very weird error. You should try to debug this by using pdb. Please add import pdb; pdb.set_trace() just before the exception: https://github.com/louismartin/fairseq/blob/1e30f27b159daced4a7340d0d0d2c590311a6061/fairseq/tasks/translation.py#L157 Or for you in this file: /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py

Then you should try to see if the files that fairseq try to access actually exist or not.

angelo-megna94 commented 4 years ago

Thanks for this advice, and sorry if I answer only now but I have not received any notification, strange! Anyway, i've done like you said but the problem persists. Now i'm trying to repeat the entire process to see if anything changes; I'll update you as soon as possible!

angelo-megna94 commented 4 years ago

UPDTE: repeating the whole process was useless :) it always gives me the same mistake and despite your advice _(import pdb; pdb.settrace()) I can't solve. I don't know if it can help, but I'll stick the script output here until before the error, maybe you can find some useful information:

Namespace(alignfile=None, cpu=False, destdir='/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, output_format='binary', padding_factor=8, seed=1, source_lang='complex', srcdict=None, target_lang='simple', task='translation', tensorboard_logdir='', testpref='/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, trainpref='/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.train', user_dir=None, validpref='/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.valid', workers=1) | [complex] Dictionary: 10175 types | [complex] /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.train.complex: 296402 sents, 11586925 tokens, 0.0% replaced by | [complex] Dictionary: 10175 types | [complex] /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.valid.complex: 992 sents, 38746 tokens, 0.0% replaced by | [complex] Dictionary: 10175 types | [complex] /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.test.complex: 359 sents, 12673 tokens, 0.0% replaced by | [simple] Dictionary: 10047 types | [simple] /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.train.simple: 296402 sents, 7597878 tokens, 0.0% replaced by | [simple] Dictionary: 10047 types | [simple] /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.valid.simple: 992 sents, 25483 tokens, 0.0% replaced by | [simple] Dictionary: 10047 types | [simple] /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/_2912c535c2343258d2e6375bca3e3a3d.test.simple: 359 sents, 10579 tokens, 0.0% replaced by | Wrote preprocessed data to /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer', attention_dropout=0.0, bucket_cap_mb=25, clip_norm=0.1, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed'], ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.2, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fix_batches_to_gpus=False, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.54, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.00011], lr_scheduler='fixed', lr_shrink=0.5, max_epoch=100, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5000, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=True, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=True, relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='/Users/angelo/PycharmProjects/Prova/access/experiments/fairseq/local_1586443380422/checkpoints', save_interval=1, save_interval_updates=5000, seed=445, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='complex', target_lang='simple', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, validations_before_sari_early_stopping=10.0, warmup_updates=4000, weight_decay=0.0001) | [complex] dictionary: 10176 types | [simple] dictionary: 10048 types Traceback (most recent call last): File "/Users/angelo/PycharmProjects/Prova/access/scripts/train.py", line 49, in fairseq_train_and_evaluate(kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 18, in wrapped_func return func(*args, *kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 29, in wrapped_func return func(args, kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 38, in wrapped_func result = func(*args, *kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/utils/training.py", line 50, in wrapped_func result = func(args, kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/fairseq/main.py", line 121, in fairseq_train_and_evaluate fairseq_train(preprocessed_dir, exp_dir=exp_dir, train_kwargs) File "/Users/angelo/PycharmProjects/Prova/access/access/fairseq/base.py", line 175, in fairseq_train train.main(train_args) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py", line 42, in main load_dataset_splits(task, ['train', 'valid']) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py", line 479, in load_dataset_splits task.load_dataset(split, combine=True) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 166, in load_dataset raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path)) FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed)

P.S. sorry if I'm wasting you time

angelo-megna94 commented 4 years ago

UPDATE 2: i saw, by pdb, that the function _def split_exists(split, src, tgt, lang, data_path)_ (in translation.py) return FALSE, so the if condition in this lines:

            if split_exists(split_k, src, tgt, src, data_path):
                pdb.set_trace()
                prefix = os.path.join(data_path, '{}.{}-{}.'.format(split_k, src, tgt))
                pdb.set_trace()
            elif split_exists(split_k, tgt, src, src, data_path):
                prefix = os.path.join(data_path, '{}.{}-{}.'.format(split_k, tgt, src))
            else:
                if k > 0 or dk > 0:
                    break
                else:
                    raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path))

ends in _raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path)) , despite the /_2912c535c2343258d2e6375bca3e3a3d/fairseqpreprocessed dir contains all files.

I am going crazy!

louismartin commented 4 years ago

Thanks for all the details. Then we need to understand why split_exists returns False. Can you please "step in" the split_exists function using the pdb command s ? That will allow you to go inside the split_exists function and understand why it returns False.

louismartin commented 4 years ago

You will probably also have to step in subsequent function calls such as IndexedRawTextDataset.exists()

        def split_exists(split, src, tgt, lang, data_path):
            filename = os.path.join(data_path, '{}.{}-{}.{}'.format(split, src, tgt, lang))
            if self.args.raw_text and IndexedRawTextDataset.exists(filename):
                return True
            elif not self.args.raw_text and IndexedDataset.exists(filename):
                return True
            return False
angelo-megna94 commented 4 years ago

I did as you recommended and I checked in detail the various steps that are carried out, there should be interested parties in bold:

**> /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(158)load_dataset() -> src, tgt = self.args.source_lang, self.args.targetlang (Pdb) s_

/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(159)load_dataset() -> if split_exists(split_k, src, tgt, src, datapath): (Pdb) s_ --Call-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(129)split_exists() -> def split_exists(split, src, tgt, lang, datapath): (Pdb) s_ /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(130)split_exists() -> filename = os.path.join(datapath, '{}.{}-{}.{}'.format(split, src, tgt, lang)) (Pdb) s --Call-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(75)join() -> def join(a, *p): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(80)join() -> a = os.fspath(a) (Pdb) s_ /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(81)join() -> sep = _getsep(a) (Pdb) s_ --Call-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(41)_get_sep() -> def _get_sep(path): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(42)_get_sep() -> if isinstance(path, bytes): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(45)_get_sep() -> return '/' (Pdb) s --Return-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(45)_get_sep()->'/' -> return '/' (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(82)join() -> path = a (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(83)join() -> try: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(84)join() -> if not p: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(86)join() -> for b in map(os.fspath, p): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(87)join() -> if b.startswith(sep): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(89)join() -> elif not path or path.endswith(sep): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(92)join() -> path += sep + b (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(86)join() -> for b in map(os.fspath, p): (Pdb) s > /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(96)join() -> return path (Pdb) s --Return-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(96)join()->'/Users/angel...imple.complex' -> return path (Pdb) p path '/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed/train.complex-simple.complex' (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(131)split_exists() -> if self.args.raw_text and IndexedRawTextDataset.exists(filename): (Pdb) s --Call-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/data/indexed_dataset.py(198)exists() -> @staticmethod (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/data/indexed_dataset.py(200)exists() -> return os.path.exists(path) (Pdb) s --Call-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(16)exists() -> def exists(path): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(18)exists() -> try: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(19)exists() -> os.stat(path) (Pdb) s FileNotFoundError: [Errno 2] No such file or directory: '/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed/train.complex-simple.complex' /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(19)exists() -> os.stat(path) (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(20)exists() -> except OSError: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(21)exists() -> return False (Pdb) s --Return-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(21)exists()->False -> return False (Pdb) s --Return-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/data/indexed_dataset.py(200)exists()->False -> return os.path.exists(path) (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(133)split_exists() -> elif not self.args.raw_text and IndexedDataset.exists(filename): (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(135)split_exists() -> return False (Pdb) s --Return-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(135)split_exists()->False -> return False (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(161)load_dataset() -> elif split_exists(split_k, tgt, src, src, data_path): (Pdb) s --Call-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(129)split_exists() -> def split_exists(split, src, tgt, lang, data_path): (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(130)split_exists() -> filename = os.path.join(data_path, '{}.{}-{}.{}'.format(split, src, tgt, lang)) (Pdb) s --Call-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(75)join() -> def join(a, *p): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(80)join() -> a = os.fspath(a) (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(81)join() -> sep = _get_sep(a) (Pdb) s --Call-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(41)_get_sep() -> def _get_sep(path): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(42)_get_sep() -> if isinstance(path, bytes): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(45)_get_sep() -> return '/' (Pdb) s --Return-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(45)_get_sep()->'/' -> return '/' (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(82)join() -> path = a (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(83)join() -> try: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(84)join() -> if not p: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(86)join() -> for b in map(os.fspath, p): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(87)join() -> if b.startswith(sep): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(89)join() -> elif not path or path.endswith(sep): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(92)join() -> path += sep + b (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(86)join() -> for b in map(os.fspath, p): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(96)join() -> return path (Pdb) s --Return-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/posixpath.py(96)join()->'/Users/angel...mplex.complex' -> return path (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(131)split_exists() -> if self.args.raw_text and IndexedRawTextDataset.exists(filename): (Pdb) s --Call-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/data/indexed_dataset.py(198)exists() -> @staticmethod (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/data/indexed_dataset.py(200)exists() -> return os.path.exists(path) (Pdb) s --Call-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(16)exists() -> def exists(path): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(18)exists() -> try: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(19)exists() -> os.stat(path) (Pdb) s FileNotFoundError: [Errno 2] No such file or directory: '/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed/train.simple-complex.complex' /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(19)exists() -> os.stat(path) (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(20)exists() -> except OSError: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(21)exists() -> return False (Pdb) s --Return-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/genericpath.py(21)exists()->False -> return False (Pdb) s --Return-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/data/indexed_dataset.py(200)exists()->False -> return os.path.exists(path) (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(133)split_exists() -> elif not self.args.raw_text and IndexedDataset.exists(filename): (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(135)split_exists() -> return False (Pdb) s --Return-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(135)split_exists()->False -> return False (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(164)load_dataset() -> if k > 0 or dk > 0: (Pdb) s /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(167)load_dataset() -> raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path)) (Pdb) s FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed) /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(167)load_dataset() -> raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path)) (Pdb) s --Return-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py(167)load_dataset()->None -> raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path)) (Pdb) s FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed) /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py(479)load_dataset_splits() -> task.load_dataset(split, combine=True) (Pdb) s --Return-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py(479)load_dataset_splits()->None -> task.load_dataset(split, combine=True) (Pdb) s FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed) /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py(42)main() -> load_dataset_splits(task, ['train', 'valid']) (Pdb) s --Return-- /Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py(42)main()->None -> load_dataset_splits(task, ['train', 'valid']) (Pdb) s FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed) /Users/angelo/PycharmProjects/Prova/access/access/fairseq/base.py(175)fairseq_train() -> train.main(train_args) (Pdb) s --Call-- /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py(116)exit() -> def exit(self, type, value, traceback): (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py(117)exit() -> if type is None: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py(125)exit() -> if value is None: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py(129)exit() -> try: (Pdb) s /usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py(130)exit() -> self.gen.throw(type, value, traceback) (Pdb) s --Call-- /Users/angelo/PycharmProjects/Prova/access/access/utils/helpers.py(177)log_stdout() -> yield (Pdb) s FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed) /Users/angelo/PycharmProjects/Prova/access/access/utils/helpers.py(155)log_stdout() -> @contextmanager (Pdb) s /Users/angelo/PycharmProjects/Prova/access/access/utils/helpers.py(179)log_stdout() -> sys.stdout = save_stdout (Pdb) s /Users/angelo/PycharmProjects/Prova/access/access/utils/helpers.py(180)log_stdout() -> log_file.close() (Pdb) s** --Return--Traceback (most recent call last): File "/Users/angelo/PycharmProjects/Prova/access/access/utils/helpers.py", line 155, in log_stdout @contextmanager File "/Users/angelo/PycharmProjects/Prova/access/access/fairseq/base.py", line 175, in fairseq_train train.main(train_args) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py", line 42, in main load_dataset_splits(task, ['train', 'valid']) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq_cli/train.py", line 479, in load_dataset_splits task.load_dataset(split, combine=True) File "/Users/angelo/PycharmProjects/Prova/access/venv/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 167, in load_dataset raise FileNotFoundError('Dataset not found: {} ({})'.format(split, data_path)) FileNotFoundError: Dataset not found: train (/Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed)

I still don't understand why the mistake; the files are there but it is as if he did not see them.

If it is not a problem, I would have another question to ask you: since I would like to try to train on Italian, would you know how to adapt another dataset to this code? Why would I use it PaCCSS-IT but I can't find a "compatibility" between the PaCCSS-IT structure and Wikilarge. It would be of great help to me.

Thanks again for your availability.

louismartin commented 4 years ago

Can you please print the filename that fairseq tries to access in pdb ? filename = os.path.join(data_path, '{}.{}-{}.{}'.format(split, src, tgt, lang))

louismartin commented 4 years ago

Or the filename as it is defined inside IndexedRawTextDataset.exists() (you can just put a python command in pdb print(filename))

louismartin commented 4 years ago

As for Italian, I think ACCESS won't adapt very well because some of the features can be computed for english only with the current state of the code (DepTreeDepth and WordRank are available only for english).

Can I ask what's your end purpose for italian out of curiosity?

angelo-megna94 commented 4 years ago

This is what comes out: (Pdb) print (filename) /Users/angelo/PycharmProjects/Prova/access/resources/datasets/_2912c535c2343258d2e6375bca3e3a3d/fairseq_preprocessed/train.complex-simple.complex

I am working on my master's thesis and I was studying the current state of the art on TS and I found your job. I found it really interesting and of possible help as a point of reference for my work

louismartin commented 4 years ago

And when you do os.path.exists(filename) inside pdb it returns false but you are 100% sure the file exists on your disk ?

Ok good luck, happy to help :)

angelo-megna94 commented 4 years ago

Yes, it is in the pycharmproject folder. I also read this:

If you read the Python documentation of os.path.exists(), it says that there are specific cases in which a file or folder exists but os.path.exists() returns false:

Return True if path refers to an existing path or an open file descriptor. Returns False for broken symbolic links. On some platforms, this function may return False if permission is not granted to execute os.stat() on the requested file, even if the path physically exists.

any idea?

angelo-megna94 commented 4 years ago

I have done tests with a script on the fly and I am seeing that by making path.exists (filename) on any of the files of the access project the result is always FALSE; I tried with another project on the fly doing the same thing with a file that I have on my PC and it gives me TRUE. So it's as if the whole project wasn't physically saved on the PC, and I don't know why hahaha

louismartin commented 4 years ago

As this problem is not related to ACCESS, I'll let you sort this out yourself, good luck!

angelo-megna94 commented 4 years ago

Solved all the problems! Thanks again!

louismartin commented 4 years ago

Great, closing the issue then.