Questions regarding fine-tuning JESC dataset

MorinoseiMorizo / jparacrawl-finetune

An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.

http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/

103 stars 8 forks source link

Questions regarding fine-tuning JESC dataset #7

Closed leminhyen2 closed 3 years ago

leminhyen2 commented 3 years ago

Hi, I have a few points in the research paper that I want to confirm and also a few questions to ask about fine-tuning procedure with JESC dataset.

From what I read:

You use the big model to fine-tune JESC
The process took around 2000 iterations

If any of the above is wrong, hope you can clarify me. Also, some questions about the procedure:

Is there any reason to fine-tune for 2000 iterations? What was your loss rate at the end?
Do you do any preprocessing for JESC? Like deduplication or some sentences filtering?

MorinoseiMorizo commented 3 years ago

Thank you for the question.

Yes, your understanding is correct.

When fine-tuning, we checked validation perplexity every 100 iterations, and we stopped fine-tuning at 2000 iterations because we found that validation perplexity was already converged (stabilized). After fine-tuning with JESC, validation perplexity is 11.26 for En-Ja and 7.51 for Ja-En.

We have not done any special preprocessing for the JESC corpus. We just tokenized it into subwords and removed too long (more than 250 subwords) sentences as written in the paper.

leminhyen2 commented 3 years ago

Oh, the 250 subwords filter is new for me. I probably missed that part in the research paper.

So if a tokenized sentence is _I_have_an_apple, then this sentence has 4 subwords right? Is this 250 subwords filter implemented in the "jparacrawl-finetune" repo, for example, for KFTT dataset? And do you remove long sentences for every other dataset mentioned in the paper too, like ASPEC and IWSLT? Last question, what happened if sentences longer than 250 words is not filtered, will the model still fine-tune or will it just stuck?

MorinoseiMorizo commented 3 years ago

Yes, the example has 4 subwords.

I'm sorry, but this filtering is not implemented in this repo. I sometimes use this script (or, of course, you can implement it by yourself). https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl We applied this cleaning method for all datasets, including ASPEC and IWSLT.

The main motivations to remove longer sentences are two folds. One is longer sentence could be noisy or too difficult to learn. The other one is a GPU memory issue. Training long sentences could cause an out-of-GPU-memory error. Thus, we removed too long sentences, but I think it does not heavily damage the training if you skip it unless it is stuck due to memory errors.

leminhyen2 commented 3 years ago

Hmm, I want to ask a bit more if you don't mind. So let's say I want to fine-tune the model for Japanese to English. In the dataset, do I implement the 250 subwords filter on the Japanese side, the translation (English side), or in both?

Also, in the above comment, you mentioned you do validation perplexity for every 100 iterations. If possible, can you elaborate/explain a bit more about how you did this? Like do you compare some metrics and see if there is a good balance between them, or do you run the checkpoint model with validation dataset to see the result, or something else.

MorinoseiMorizo commented 3 years ago

I removed the sentence pair if either Japanese or English sentence has more than 250 subwords.

For our training, we used the fairseq toolkit, and it has an option to set validation intervals (--validate-interval-updates). https://fairseq.readthedocs.io/en/latest/command_line_tools.html And I only used validation perplexity to select the best model (does this answer your question?).

I'm happy to answer your question, and if you still have one, please let me know.

leminhyen2 commented 3 years ago

Ah, thank you so much, if I have any more questions I'll ask in this repo