Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
323 stars 40 forks source link

Korean Finetuning #78

Open hdeval1 opened 2 years ago

hdeval1 commented 2 years ago

I was able to finetune the base korean model using TMX data by editing the finetune recipe, but now I am having issues with the model. When I changed the finetune recipe, I found the filter-korean.sh script and substituted the steps:

python3 ../scripts/filter/bitext-match-lang.py -s $$s -t $$t | \
    grep --invert-match '[<>{}]' | \
    $(TOKENIZER)/replace-unicode-punctuation.perl |\
    perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
    sed 's/  */ /g;s/^ *//g;s/ *$$//g' |\
    shuf > ${TMX_DEV_BASE}.$$s-$$t.shuffled; \
    mkdir -p $$s-$$t/${TMXBASE}/dev; \

with the following:

/bin/bash ../scripts/filter/filter-korean.sh ${SRC} ${TRG} $$d > ${TMXBASE}.clean; \
    cat ${TMXBASE}.clean | \
    $(TOKENIZER)/replace-unicode-punctuation.perl |\
    perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
    sed 's/  */ /g;s/^ *//g;s/ *$$//g' |\
    shuf > ${TMXBASE}.$$s-$$t.shuffled; \
    mkdir -p $$s-$$t/${TMXBASE}/train; \
    mkdir -p $$s-$$t/${BASEMODELNAME}; \

That seemed to do the trick to kick of the tuning, however with the new tuned model I am having a punctuation issue. If you send something like this (it would be in korean but for the sake of explaining i did it in english):

hello my name is heather. 
-heather is here to say hello,
*how are you today?

where the trailing character before the new line is punctuation & the first character of the next line is a punctuation followed directly by a character, the translation comes out incorrect and the punctuation / new lines is off. I did notice, if you send in each line individually, then the translations come out correctly and no punctuation issues are present. It seems as though the spaces/punctuation is causing the text to be interpreted as a sentence and therefore affecting the translation. I looked through the backlog and noticed there were some initial issues with Korean, so I figured I would ask and see if you had any insight on what the issue may be.

Thank you!