Closed avostryakov closed 1 year ago
Yes, I messed up the yaml-files somehow. I'll try to solve the issue ...
I tried to fix the vocab files. Could you, please, try again and let me know whether this now works as expected?
I tried to fix the vocab files. Could you, please, try again and let me know whether this now works as expected?
Now conversion works without errors. Thank you!
But when it translates it continue to generate long rubbish sequences
This is strange! I just downloaded the MarianNMT model and that behaves as expected:
echo "Onko se kohteliaisuus vai loukkaus?" | ./preprocess.sh fin source.spm | marian-decoder -c decoder.yml
...
[2021-11-06 22:19:33] Best translation 0 : ▁I s ▁it ▁a ▁compliment ▁or ▁an ▁insult ?
Could you verify that this also works for you? I didn't check a converted model yet. It may be something with the conversion script that does not work well together with the new version of the vocab file ....
I think the problem is with the generation of stop tokens. So, translation is correct from the beginning but it doesn't stop
Together with the guys from huggingface we found the problem. The fin-eng model should now work correctly. Could you, please, double check? For the new conversion you need to pull the latest version of the transformer library.
Wow, what is a surprise that you fixed it! I see it's this fix in transformers: https://github.com/huggingface/transformers/commit/b48faae364d4eeac56c25c2fa9abb60599b96933
I try to convert the last fin-eng model (opusTCv20210807+bt-2021-08-25.zip) to pytorch with your script:
python convert_marian_tatoeba_to_pytorch.py -m fin-eng
Everything works until sentencepiece files conversion with the following error:
The problem with these tokens in opusTCv20210807+bt.spm32k-spm32k.vocab.yml: "": 243 "": 244 "": 245 "": 246 "": 247 " ": 248 "": 249 "": 250 "": 251 "": 252 "": 253 "": 254 "": 255 "": 256 "": 257 "": 258 "": 259 "": 260 "": 261 "": 262 "": 263 "": 264 "": 265 "": 266 "": 267 "": 268 "": 269 "": 270 "": 271 "": 272 "": 273 "": 274
I can replace these tokens with token1, token2, token3, ... and it will be converted without issues but later when I will translate something from Finnish to English language with model.generate(**batch, num_beams=5) an output will be kind of strange:
For: "Onko se kohteliaisuus vai loukkaus?" translation will be: "Is it a compliment or an insult?ment...?.::..?, is that a compliment, or is it an insult, or a compliment?.........?s.........that's not a compliment., that's an insult....and it's a compliment!....?.?..??,??.., or an offense?......... and...?............?.... it'? [..] ? -, ow. - ow ow... owt. [.. ow?]..... a compliment...... a compliments?? ow........ an insult......?!....-... ow-....??.??. ow,?..?-w.--. [? -?...??,-? a...,... [??? [.??: owing?-- ows??....?.,? "? is? [?...-?...?...?]?...? [ ow... is ow:? the...--? ]? a...? ou.?]? a. "? "? :? is? " ".:?)? -? ] :, "..., is? -??]? ;? and? (? )? ]? *? is?? "?,, that?? }? #? in? :? ) - }.? ". -, oh? ; . ou? '?"
True translation is "Is it a compliment or an insult?" (with previous fin-eng model). So, it looks like it started to translate correctly but it can't stop generating output. It's a real problem.
Mac OS 10.14, python 3.8.9, pytorch 1.8.1, transformers 4.8.2