Closed j0ma closed 4 years ago
Friendly ping to @d-ataman in a separate comment to make sure there is a notification. :)
Hi J0MA,
Thanks for your interest in the code! I updated the example scripts so you can use the same arguments in training/translate. No you don't need any dependency from OpenNMT-py or the CharNMT repo, that was the version that implements character LSTM in the encoder, lmm repository should be sufficient to run the code. Please check the requirements file to check if you have all libraries installed.
Note that the translation script (in translate/Translator.py) uses the hierarchical beam search algorithm, which does not support batch translation, so you can use the arguments in the examples to do the translation. I have to warn that this is quite slow.
This is where you can find the data: https://wit3.fbk.eu/ of course you would need to convert to txt. You should tokenize/lowercase/truecase both sides of the corpora, then apply BPE of 16000 merge rules for the source (EN) side, you can leave target language files in the original word format, and run preprocess.sh in the examples directory, the code will automatically load data and make subword batches for the source and separate target sentences into word/character-level batches.
Let me know if you encounter any problems!
Hi again, and thank you for your response!
A few follow-up questions:
subword-nmt
, SentencePiece
etc. I did notice that onmt.io.Wordbatch
contains a split_bpe()
method but that doesn't seem like the right function. OpenNMT-py currently has an implementation of learning BPE but it seems like it's not included in the lmm
repository.Thanks so much again for your help! Looking forward to getting this reproduction going. :)
Hi again @d-ataman !
Regarding the IWSLT datasets, are they also accessible from the WIT3 website? I am only able to see data going back to 2011 when I visit https://wit3.fbk.eu, and not all of those even have dev/test sets available (based on the description).
Thanks!
Hi @j0ma
Yes you can download the data from WIT3. For preprocessing you can use the moses scripts:
Tokenization and Lowercasing: https://github.com/moses-smt/mosesdecoder/tree/8c5eaa1a122236bbf927bde4ec610906fea599e6/scripts/tokenizer
Truecasing: https://github.com/moses-smt/mosesdecoder/tree/8c5eaa1a122236bbf927bde4ec610906fea599e6/scripts/recaser
For subword segmentation I use the subword-nmt scripts: https://github.com/rsennrich/subword-nmt. For the English side you can use 16k merge rules, you don't need to segment the target side. 32k was used in the larger training data.
I also updated the example scripts for you.
Best wishes, Duygu
Thanks again for your help @d-ataman !
I've now got all the preprocessing done, and preprocess.sh
runs successfully. However, I noticed that I get several weird errors about tensor shapes / missing attributes. (for details, see here)
Currently I'm using torchtext 0.2.1 (based on this) and tried pytorch 0.3.1, 0.4.0 and 1.4.0, without any luck.
Therefore, I was wondering what pytorch (& torchtext) version the codebase is based on?
Hi @j0ma ,
I think last time I ran the code it was with torch 0.4.1. Hope it works!
Duygu
Hi there,
I recently started going through the code in this repository after having read your paper, which I found very fascinating.
I would be very interested in trying to reproduce the results of the paper using the code from the repository. With that in mind, I had the following questions:
What version of the IWSLT evaluation data does the paper use? For instance, IWSLT 2019 seems to have used English -> Czech but I'm not sure where to obtain the data for other language pairs.(found the answer to this on p.13 of the paper)./examples
.onmt
folder is very similar to OpenNMT-py version 0.1. Do I need to separately install OpenNMT-py to use this repository or is this a fork + extension of OpenNMT-py? If I do, is v1.0 compatible?