Using this scripts with a pytorch rnn for language translation

liperrino commented 5 years ago

Please i would like to know how to use this scripts or the results of this scripts with a pytorch rnn language translation model when training and evaluating

pjwilliams commented 5 years ago

These scripts were written for use with the Nematus toolkit. Some could be used without change (or with minimal changes), such as the preprocessing and postprocessing scripts, but others would have to be rewritten.

On 14 Nov 2018, at 14:06, liperrino notifications@github.com wrote:

Please i would like to know how to use this scripts or the results of this scripts with a pytorch rnn language translation model when training and evaluating

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EdinburghNLP/wmt17-scripts/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaY7tgklF8D8stAJs4VTZuDMz-LjKTks5uvCN5gaJpZM4Ydz7-.

liperrino commented 5 years ago

These scripts were written for use with the Nematus toolkit. Some could be used without change (or with minimal changes), such as the preprocessing and postprocessing scripts, but others would have to be rewritten. … On 14 Nov 2018, at 14:06, liperrino @.***> wrote: Please i would like to know how to use this scripts or the results of this scripts with a pytorch rnn language translation model when training and evaluating — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaY7tgklF8D8stAJs4VTZuDMz-LjKTks5uvCN5gaJpZM4Ydz7-.

Thank you or the answer. I already knew that some of them were build for Nematus toolkit. It is my bad because i was not really specific, i'm new in language translation and i have started with Pytorch. My problem is that i would like to know how to use the dictionary obtain after the preprocessing with Pytorch like it was done for Nematus toolkit when using the train.sh script and also the evaluate.sh script. The other problem is that how can i train my Pytorch encoder-decoder with attention so that it will not translate words that are tag with token and then pass the resulting translation to Moses smt to process only words that are tag to have a better result.It would be great if you can help me or guide me.

pjwilliams commented 5 years ago

I think you are referring to the JSON vocabulary dictionaries produced by Nematus' data/build_dictionary.py. Those are specific to Nematus, so if your toolkit requires an equivalent vocabulary dictionary then you will have to convert them to the appropriate format (or construct new ones from scratch based on the vocabulary contained in the preprocessed training data).

Re your second question, I think this is somewhat tricky. I'm not aware of a straightforward general solution (though I could easily have missed one). Depending on what you're trying to do, it may be sufficient to edit your training data so that the tokens you want to leave untranslated are substituted with placeholder tokens (on both the source and target side) that can be restored after translation and then further processed with Moses. But generally there are no guarantees that the NMT decoder will produce the desired placeholder tokens. For that, I think you would need to do something like this paper proposes: http://aclweb.org/anthology/P17-1141

On 14 Nov 2018, at 23:48, liperrino notifications@github.com wrote:

These scripts were written for use with the Nematus toolkit. Some could be used without change (or with minimal changes), such as the preprocessing and postprocessing scripts, but others would have to be rewritten. … <x-msg://2/#> On 14 Nov 2018, at 14:06, liperrino @.***> wrote: Please i would like to know how to use this scripts or the results of this scripts with a pytorch rnn language translation model when training and evaluating — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1 https://github.com/EdinburghNLP/wmt17-scripts/issues/1>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaY7tgklF8D8stAJs4VTZuDMz-LjKTks5uvCN5gaJpZM4Ydz7- https://github.com/notifications/unsubscribe-auth/ABDaY7tgklF8D8stAJs4VTZuDMz-LjKTks5uvCN5gaJpZM4Ydz7-.

Thank you or the answer. I already knew that some of them were build for Nematus toolkit. It is my bad because i was not really specific, i'm new in language translation and i have started with Pytorch. My problem is that i would like to know how to use the dictionary obtain after the preprocessing with Pytorch like it was done for Nematus toolkit when using the train.sh script and also the evaluate.sh script. The other problem is that how can i train my Pytorch encoder-decoder with attention so that it will not translate words that are tag with token and then pass the resulting translation to Moses smt to process only words that are tag to have a better result.It would be great if you can help me or guide me.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/EdinburghNLP/wmt17-scripts/issues/1#issuecomment-438861932, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaYxbDOccv4R4Yq6MojrThmFxjEsJUks5uvKvegaJpZM4Ydz7-.

liperrino commented 5 years ago

I think you are referring to the JSON vocabulary dictionaries produced by Nematus' data/build_dictionary.py. Those are specific to Nematus, so if your toolkit requires an equivalent vocabulary dictionary then you will have to convert them to the appropriate format (or construct new ones from scratch based on the vocabulary contained in the preprocessed training data). Re your second question, I think this is somewhat tricky. I'm not aware of a straightforward general solution (though I could easily have missed one). Depending on what you're trying to do, it may be sufficient to edit your training data so that the tokens you want to leave untranslated are substituted with placeholder tokens (on both the source and target side) that can be restored after translation and then further processed with Moses. But generally there are no guarantees that the NMT decoder will produce the desired placeholder tokens. For that, I think you would need to do something like this paper proposes: http://aclweb.org/anthology/P17-1141 … On 14 Nov 2018, at 23:48, liperrino @.> wrote: These scripts were written for use with the Nematus toolkit. Some could be used without change (or with minimal changes), such as the preprocessing and postprocessing scripts, but others would have to be rewritten. … <x-msg://2/#> On 14 Nov 2018, at 14:06, liperrino @.> wrote: Please i would like to know how to use this scripts or the results of this scripts with a pytorch rnn language translation model when training and evaluating — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1 <#1>>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaY7tgklF8D8stAJs4VTZuDMz-LjKTks5uvCN5gaJpZM4Ydz7- https://github.com/notifications/unsubscribe-auth/ABDaY7tgklF8D8stAJs4VTZuDMz-LjKTks5uvCN5gaJpZM4Ydz7-. Thank you or the answer. I already knew that some of them were build for Nematus toolkit. It is my bad because i was not really specific, i'm new in language translation and i have started with Pytorch. My problem is that i would like to know how to use the dictionary obtain after the preprocessing with Pytorch like it was done for Nematus toolkit when using the train.sh script and also the evaluate.sh script. The other problem is that how can i train my Pytorch encoder-decoder with attention so that it will not translate words that are tag with token and then pass the resulting translation to Moses smt to process only words that are tag to have a better result.It would be great if you can help me or guide me. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaYxbDOccv4R4Yq6MojrThmFxjEsJUks5uvKvegaJpZM4Ydz7-.

Thanks a lot

liperrino commented 5 years ago

Please i have got this error when running evaluate.sh to evaluate the code from this repository. Please what is going on. This is the ouput from the try:

usage: translate.py [-h] [-v] -m PATH [PATH ...] [-b INT] [-i PATH] [-o PATH] [-k INT] [-n [ALPHA]] [--n_best] [--maxibatch_size INT] translate.py: error: unrecognized arguments: -p 1 Detokenizer Version $Revision: 4134 $ Language: en Use of uninitialized value $length_reference in numeric eq (==) at /home/beyala/nematus//data/multi-bleu-detok.perl line 155. BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)

liperrino commented 5 years ago

Please i have got this error when running evaluate.sh to evaluate the code from this repository. Please what is going on. This is the ouput from the try:

usage: translate.py [-h] [-v] -m PATH [PATH ...] [-b INT] [-i PATH] [-o PATH] [-k INT] [-n [ALPHA]] [--n_best] [--maxibatch_size INT] translate.py: error: unrecognized arguments: -p 1 Detokenizer Version $Revision: 4134 $ Language: en Use of uninitialized value $length_reference in numeric eq (==) at /home/beyala/nematus//data/multi-bleu-detok.perl line 155. BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)

I have corrected the error but here is the new output error: 2018-11-23 09:39:56.009942: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Error: config file ./..//model/model.best-valid-script.json is missing Detokenizer Version $Revision: 4134 $ Language: en Use of uninitialized value $length_reference in numeric eq (==) at /home/beyala/nematus//data/multi-bleu-detok.perl line 155. BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)

liperrino commented 5 years ago

Please i have got this error when running evaluate.sh to evaluate the code from this repository. Please what is going on. This is the ouput from the try: usage: translate.py [-h] [-v] -m PATH [PATH ...] [-b INT] [-i PATH] [-o PATH] [-k INT] [-n [ALPHA]] [--n_best] [--maxibatch_size INT] translate.py: error: unrecognized arguments: -p 1 Detokenizer Version $Revision: 4134 $ Language: en Use of uninitialized value $length_reference in numeric eq (==) at /home/beyala/nematus//data/multi-bleu-detok.perl line 155. BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)

I have corrected the error but here is the new output error: 2018-11-23 09:39:56.009942: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Error: config file ./..//model/model.best-valid-script.json is missing Detokenizer Version $Revision: 4134 $ Language: en Use of uninitialized value $length_reference in numeric eq (==) at /home/beyala/nematus//data/multi-bleu-detok.perl line 155. BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)

here is the new output error that i have got after correcting some bugs:

INFO: Loading model parameters from file /home/beyala/wmt17-scripts/training/model/model Traceback (most recent call last): File "/home/beyala/nematus//nematus/translate.py", line 69, in main(settings) File "/home/beyala/nematus//nematus/translate.py", line 49, in main ensemble_scope=scope) File "/home/beyala/nematus/nematus/model_loader.py", line 93, in init_or_restore_variables saver.restore(sess, os.path.abspath(reload_filename)) File "/home/beyala/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1538, in restore

compat.as_text(save_path)) ValueError: The passed save_path is not a valid checkpoint: /home/beyala/wmt17-scripts/training/model/model Detokenizer Version $Revision: 4134 $ Language: en Use of uninitialized value $length_reference in numeric eq (==) at /home/beyala/nematus//data/multi-bleu-detok.perl line 155. BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)

rsennrich commented 5 years ago

you first need to train a model, and let training run at least until the first checkpoint is saved (see --saveFreq), or, to use model.best-valid-script, until the first model has been validated (see --validFreq).

liperrino commented 5 years ago

i have train it but in the model folder i have only one model named model.json. I was so supprised that the training did not take more than 10 min. I am using an intel corei7. I do not know what is going on.

rsennrich commented 5 years ago

it's unlikely that you can train a model on CPU in a reasonable time, except perhaps on tiny amounts of data. What is more likely is that your training crashed after 10min for some reason. This should have produced some error message.

liperrino commented 5 years ago

that's true but there was no error display instead it has generated a model.json file where we have configurations about the training step. I have set the device variable to cpu, could it be the cause of a training of small time?

pjwilliams commented 5 years ago

Could you rerun the train.sh script? The output should contain some information about what went wrong.

On 23 Nov 2018, at 18:42, liperrino notifications@github.com wrote:

that's true but there was no error display instead it has generated a model.json file where we have configurations about the training step. I have set the device variable to cpu, could it be the cause of a training of small time?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/EdinburghNLP/wmt17-scripts/issues/1#issuecomment-441301861, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaY8tL8Fv5Ug-NcLDvWiPTr-Iy62Hzks5uyEGHgaJpZM4Ydz7-.

liperrino commented 5 years ago

I have rerun the train.sh script and this is the end of the output:

INFO: Starting epoch 4960 INFO: Starting epoch 4961 INFO: Starting epoch 4962 INFO: Starting epoch 4963 INFO: Starting epoch 4964 INFO: Starting epoch 4965 INFO: Starting epoch 4966 INFO: Starting epoch 4967 INFO: Starting epoch 4968 INFO: Starting epoch 4969 INFO: Starting epoch 4970 INFO: Starting epoch 4971 INFO: Starting epoch 4972 INFO: Starting epoch 4973 INFO: Starting epoch 4974 INFO: Starting epoch 4975 INFO: Starting epoch 4976 INFO: Starting epoch 4977 INFO: Starting epoch 4978 INFO: Starting epoch 4979 INFO: Starting epoch 4980 INFO: Starting epoch 4981 INFO: Starting epoch 4982 INFO: Starting epoch 4983 INFO: Starting epoch 4984 INFO: Starting epoch 4985 INFO: Starting epoch 4986 INFO: Starting epoch 4987 INFO: Starting epoch 4988 INFO: Starting epoch 4989 INFO: Starting epoch 4990 INFO: Starting epoch 4991 INFO: Starting epoch 4992 INFO: Starting epoch 4993 INFO: Starting epoch 4994 INFO: Starting epoch 4995 INFO: Starting epoch 4996 INFO: Starting epoch 4997 INFO: Starting epoch 4998 INFO: Starting epoch 4999

pjwilliams commented 5 years ago

It looks like you are trying to train using empty data files, which suggests the previous preprocessing step went wrong. Could you try rerunning preprocess.sh and recording the output?

On 24 Nov 2018, at 16:41, liperrino notifications@github.com wrote:

I have rerun the train.sh script and this is the end of the output:

INFO: Starting epoch 4960 INFO: Starting epoch 4961 INFO: Starting epoch 4962 INFO: Starting epoch 4963 INFO: Starting epoch 4964 INFO: Starting epoch 4965 INFO: Starting epoch 4966 INFO: Starting epoch 4967 INFO: Starting epoch 4968 INFO: Starting epoch 4969 INFO: Starting epoch 4970 INFO: Starting epoch 4971 INFO: Starting epoch 4972 INFO: Starting epoch 4973 INFO: Starting epoch 4974 INFO: Starting epoch 4975 INFO: Starting epoch 4976 INFO: Starting epoch 4977 INFO: Starting epoch 4978 INFO: Starting epoch 4979 INFO: Starting epoch 4980 INFO: Starting epoch 4981 INFO: Starting epoch 4982 INFO: Starting epoch 4983 INFO: Starting epoch 4984 INFO: Starting epoch 4985 INFO: Starting epoch 4986 INFO: Starting epoch 4987 INFO: Starting epoch 4988 INFO: Starting epoch 4989 INFO: Starting epoch 4990 INFO: Starting epoch 4991 INFO: Starting epoch 4992 INFO: Starting epoch 4993 INFO: Starting epoch 4994 INFO: Starting epoch 4995 INFO: Starting epoch 4996 INFO: Starting epoch 4997 INFO: Starting epoch 4998 INFO: Starting epoch 4999

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/EdinburghNLP/wmt17-scripts/issues/1#issuecomment-441379992, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaY46SNUTyYJtU9o9T3BpAm1j6HoxAks5uyXangaJpZM4Ydz7-.

liperrino commented 5 years ago

ok, i will do it. Thanks

liperrino commented 5 years ago

Number of threads: 1 Tokenizer Version 1.1 Language: en Number of threads: 1 clean-corpus.perl: processing ./../data/corpus.tok.de & .en to ./../data/corpus.tok.clean, cutoff 1-80, ratio 9 ..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000)..........(3400000)..........(3500000)..........(3600000)..........(3700000)..........(3800000)..........(3900000)..........(4000000)..........(4100000)..........(4200000)..........(4300000)..........(4400000)..........(4500000)..........(4600000)..........(4700000)..........(4800000)..........(4900000)..........(5000000)..........(5100000)..........(5200000)..........(5300000)..........(5400000)..........(5500000)..........(5600000)..........(5700000)..........(5800000)..........(5900000). Input sentences: 5919142 Output sentences: 5852457 usage: learn_joint_bpe_and_vocab.py [-h] --input PATH [PATH ...] --output PATH [--symbols SYMBOLS] [--separator STR] --write-vocabulary PATH [PATH ...] [--min-frequency FREQ] [--total-symbols] [--verbose] learn_joint_bpe_and_vocab.py: error: argument --input/-i is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required Processing ./../data/corpus.bpe.de Done Processing ./../data/corpus.bpe.en Done

liperrino commented 5 years ago

Number of threads: 1 Tokenizer Version 1.1 Language: en Number of threads: 1 clean-corpus.perl: processing ./../data/corpus.tok.de & .en to ./../data/corpus.tok.clean, cutoff 1-80, ratio 9 ..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000)..........(3400000)..........(3500000)..........(3600000)..........(3700000)..........(3800000)..........(3900000)..........(4000000)..........(4100000)..........(4200000)..........(4300000)..........(4400000)..........(4500000)..........(4600000)..........(4700000)..........(4800000)..........(4900000)..........(5000000)..........(5100000)..........(5200000)..........(5300000)..........(5400000)..........(5500000)..........(5600000)..........(5700000)..........(5800000)..........(5900000). Input sentences: 5919142 Output sentences: 5852457 usage: learn_joint_bpe_and_vocab.py [-h] --input PATH [PATH ...] --output PATH [--symbols SYMBOLS] [--separator STR] --write-vocabulary PATH [PATH ...] [--min-frequency FREQ] [--total-symbols] [--verbose] learn_joint_bpe_and_vocab.py: error: argument --input/-i is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required Processing ./../data/corpus.bpe.de Done Processing ./../data/corpus.bpe.en Done

This is the output after runing preprocess.sh

pjwilliams commented 5 years ago

Hmmm, that's strange. Looking at the first error:

learn_joint_bpe_and_vocab.py: error: argument --input/-i is required

I'm not sure how that can happen. The corresponding command in preprocess.sh should look like this:

$bpe_scripts/learn_joint_bpe_and_vocab.py -i $data_dir/corpus.tc.$src $data_dir/corpus.tc.$trg --write-vocabulary $data_dir/vocab.$src $data_dir/vocab.$trg -s $bpe_operations -o $model_dir/$src$trg.bpe

And clearly the -i argument is there. Have you edited the script in any way?

On 27 Nov 2018, at 07:39, liperrino notifications@github.com wrote:

Number of threads: 1 Tokenizer Version 1.1 Language: en Number of threads: 1 clean-corpus.perl: processing ./../data/corpus.tok.de & .en to ./../data/corpus.tok.clean, cutoff 1-80, ratio 9 ..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000)..........(3400000)..........(3500000)..........(3600000)..........(3700000)..........(3800000)..........(3900000)..........(4000000)..........(4100000)..........(4200000)..........(4300000)..........(4400000)..........(4500000)..........(4600000)..........(4700000)..........(4800000)..........(4900000)..........(5000000)..........(5100000)..........(5200000)..........(5300000)..........(5400000)..........(5500000)..........(5600000)..........(5700000)..........(5800000)..........(5900000). Input sentences: 5919142 Output sentences: 5852457 usage: learn_joint_bpe_and_vocab.py [-h] --input PATH [PATH ...] --output PATH [--symbols SYMBOLS] [--separator STR] --write-vocabulary PATH [PATH ...] [--min-frequency FREQ] [--total-symbols] [--verbose] learn_joint_bpe_and_vocab.py: error: argument --input/-i is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required usage: apply_bpe.py [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH] [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT] [--glossaries STR [STR ...]] apply_bpe.py: error: argument --codes/-c is required Processing ./../data/corpus.bpe.de Done Processing ./../data/corpus.bpe.en Done

This is the output after runing preprocess.sh

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/EdinburghNLP/wmt17-scripts/issues/1#issuecomment-441958004, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaYxQHqUYgyxE5d8xytfPDg_AgPiBvks5uzOxAgaJpZM4Ydz7-.

liperrino commented 5 years ago

no of course, i did not change anything yet

liperrino commented 5 years ago

I do not know what is going wrong with the code. i have cloned the repository again and make changes in the vars file and i have got the same errors when running the preprocess.sh script.

liperrino commented 5 years ago

please clone the repository and run it your self to see what is going wrong.

liperrino commented 5 years ago

I have clone it again, download data, preprocess data and start training. This is the new error comming from the training:

INFO: Building model... INFO: Initializing model parameters from scratch... INFO: Done INFO: Reading data... INFO: Done INFO: Initial uidx=0 INFO: Starting epoch 0 2018-11-30 07:59:48.108066: W tensorflow/core/framework/allocator.cc:122] Allocation of 1209883200 exceeds 10% of system memory.

rsennrich commented 5 years ago

This is not an error, just a warning, and it's quite normal that you'd see this on a machine that doesn't have much memory. You can see if training is running successfully if you get an occasional update about training loss and speed.

liperrino commented 5 years ago

Ok. Thanks very much

abduljamil commented 5 years ago

@rsennrich which version of tensorflow to use for converting the trained model to frozen graph?

rsennrich commented 5 years ago

do you refer to this functionality to merge the computation graph and model parameters into one file? If so, I've never tested this, and the translation and scoring scripts work with the separate checkpoint files.

abduljamil commented 5 years ago

@rsennrich thank you for the response. How can I know the architecture of the model? Input shape, output shape? If I select one language for example en-de, in the folder I have so many files. Which files are actually required to convert the model to a frozen graph or to be precise to coreml?

rsennrich commented 5 years ago

The number of files is indeed large; this is partially because each directory will contain multiple models, partially because we release models both for the Theano and the Tensorflow branch of Nematus.

each directory will contain several models (for ensembling):

up to 4 left-to-right models: model.l2r.ens.npz
up to 4 right-to-left models: model.r2l.ens.npz

each model will also have the model parameters in two formats:

the format compatible with Theano:
- model.l2r.ens1.npz
the format compatible with Tensorflow:
- model.l2r.ens1.npz.index
- model.l2r.ens1.npz.data-00000-of-00001
- model.l2r.ens1.npz.meta

shared between them is a human-readable file that Nematus uses to read some architecture settings, and that points to the vocabulary files:

model.l2r.ens1.npz.json

to do translation of raw text, you'll also want to download the pre/postprocessings scripts and BPE file:

*.sh
bpe.model.en
vocab.*

abduljamil commented 5 years ago

What version of tensorflow is used for training these models? I am getting this error when converting these models to coreml _ValueError: NodeDef mentions attr 'validateshape' not in Op<name=Identity; signature=input:T -> output:T; attr=T:type>; NodeDef: {{node encoder/embedding/embeddings/Assign}}. (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).

rsennrich commented 5 years ago

I think 1.13.1 was installed at the time I created these.

EdinburghNLP / wmt17-scripts

Using this scripts with a pytorch rnn for language translation #1