Preparing fine-tune data for Marian

JoeTseHot commented 2 years ago

Hi, thank you for the great work and for your continued involvement with the community over the years.

I am preparing fine-tune data for en-zh and zh-en Helsinki model for use in Marian NMT.

I took a look at the scripts here and I struggle to understand most of it, fault is on me for skipping perl.

I am thinking of preparing the fine-tune data in this way:

File 1 English sentence 1, processed by the supplied spm (source spm for en-zh model) into something like this: _We _are _pre pro _cessed English sentence 2 (processed in python spm.encode(sentence2, out_type=str)) English sentence 3

File 2 zh sentence 1, processed by the target spm for en-zh model zh sentence 2 zh sentence 3

I know that I am skipping punctuation removal from the perl script -> if this is important I can preprocess it in python. I notice that this was (seemingly) not done at inference for your excellent web server at opus-mt git.

I also know that I am skipping sentence normalization and punctuation normalization, I wonder if this is important

I am probably skipping guided alignment, but I don't know how to do that without an .align file

Since the data have been preprocessed into sentences already, I don't need MosesSentenceSplitter (which produce some weird results for my corpus) . For sentences that are too long (> 128 English words or Chinese words?) I will just drop them.

My questions are:

Does this setup look reasonable to you? Any comment before I commit to the lengthy training process is highly appreciated.
Should I re-train the supplied spm files? If so, should this be done before training at Marian? Can you please point me to one of the perl training scripts to understand better how this is done
Does this marian command look reasonable to you? marian \ --train-sets file1.txt file2.txt \ --vocabs opus+bt.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.yml \ --valid-set valid-set-from-file1.txt valid-set-from-file2.txt \ --valid-metrics cross-entropy translation \ --pretrained-model enzh/opus+bt.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz

Thanks! Joe

jorgtied commented 2 years ago

That looks OK to me. You don't need to use the same sentence splitting and data pre-processing pipeline except for the subword segmentation with the sentence piece models (as you do above). It could happen that the data will include some characters, which are not part of the vocabulary. That could be a slight problem but I am not sure whether this will really happen. Make sure that you do not fine-tune for too long. Otherwise, you overfit to the fine-tune data and start forgetting the pre-trained information.

In any case, don't retrain the sentence piece models. If you do that then the segmented data will most probably not match the vocab file and you will get unknowns.

hdeval1 commented 2 years ago

@JoeTseHot Were you able to get the Finetuning working? If so do you have any advice/scripts I could use? I am hoping to finetune some models too and am having trouble with the finetune process outlined in OPUS-MT-train. Thank you!

Helsinki-NLP / OPUS-MT-train

Preparing fine-tune data for Marian #71