Special characters are not allowed when converting the last fin-eng model to pytorch

avostryakov commented 3 years ago

I try to convert the last fin-eng model (opusTCv20210807+bt-2021-08-25.zip) to pytorch with your script:

python convert_marian_tatoeba_to_pytorch.py -m fin-eng

Everything works until sentencepiece files conversion with the following error:

....
 File "/Users/vostryakov/projects/nmt/convert_marian_tatoeba_to_pytorch.py", line 1283, in <module>
    resolver.convert_models(args.models[0])
  File "/Users/vostryakov/projects/nmt/convert_marian_tatoeba_to_pytorch.py", line 128, in convert_models
    converted_paths = convert_all_sentencepiece_models(entries_to_convert, dest_dir=self.model_card_dir)
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/transformers/models/marian/convert_marian_to_pytorch.py", line 306, in convert_all_sentencepiece_models
    convert(save_dir / k, dest_dir / f"opus-mt-{pair_name}")
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/transformers/models/marian/convert_marian_to_pytorch.py", line 586, in convert
    add_special_tokens_to_vocab(source_dir)
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/transformers/models/marian/convert_marian_to_pytorch.py", line 384, in add_special_tokens_to_vocab
    vocab = load_yaml(find_vocab_file(model_dir))
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/transformers/models/marian/convert_marian_to_pytorch.py", line 607, in load_yaml
    return yaml.load(f, Loader=yaml.BaseLoader)
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/yaml/loader.py", line 14, in __init__
    Reader.__init__(self, stream)
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/yaml/reader.py", line 85, in __init__
    self.determine_encoding()
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/yaml/reader.py", line 135, in determine_encoding
    self.update(1)
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/yaml/reader.py", line 169, in update
    self.check_printable(data)
  File "/Users/vostryakov/projects/env3.7/ds-vqCpqeZ3-py3.8/lib/python3.8/site-packages/yaml/reader.py", line 143, in check_printable
    raise ReaderError(self.name, position, ord(character),
yaml.reader.ReaderError: unacceptable character #x0080: special characters are not allowed
  in "marian_ckpt/fin-eng/opusTCv20210807+bt.spm32k-spm32k.vocab.yml", position 2357

The problem with these tokens in opusTCv20210807+bt.spm32k-spm32k.vocab.yml: "": 243 "": 244 "": 245 "": 246 "": 247 "": 248 "": 249 "": 250 "": 251 "": 252 "": 253 "": 254 "": 255 "": 256 "": 257 "": 258 "": 259 "": 260 "": 261 "": 262 "": 263 "": 264 "": 265 "": 266 "": 267 "": 268 "": 269 "": 270 "": 271 "": 272 "": 273 "": 274

I can replace these tokens with token1, token2, token3, ... and it will be converted without issues but later when I will translate something from Finnish to English language with model.generate(**batch, num_beams=5) an output will be kind of strange:

For: "Onko se kohteliaisuus vai loukkaus?" translation will be: "Is it a compliment or an insult?ment...?.::..?, is that a compliment, or is it an insult, or a compliment?.........?s.........that's not a compliment., that's an insult....and it's a compliment!....?.?..??,??.., or an offense?......... and...?............?.... it'? [..] ? -, ow. - ow ow... owt. [.. ow?]..... a compliment...... a compliments?? ow........ an insult......?!....-... ow-....??.??. ow,?..?-w.--. [? -?...??,-? a...,... [??? [.??: owing?-- ows??....?.,? "? is? [?...-?...?...?]?...? [ ow... is ow:? the...--? ]? a...? ou.?]? a. "? "? :? is? " ".:?)? -? ] :, "..., is? -??]? ;? and? (? )? ]? *? is?? "?,, that?? }? #? in? :? ) - }.? ". -, oh? ; . ou? '?"

True translation is "Is it a compliment or an insult?" (with previous fin-eng model). So, it looks like it started to translate correctly but it can't stop generating output. It's a real problem.

Mac OS 10.14, python 3.8.9, pytorch 1.8.1, transformers 4.8.2

jorgtied commented 3 years ago

Yes, I messed up the yaml-files somehow. I'll try to solve the issue ...

jorgtied commented 3 years ago

I tried to fix the vocab files. Could you, please, try again and let me know whether this now works as expected?

avostryakov commented 3 years ago

I tried to fix the vocab files. Could you, please, try again and let me know whether this now works as expected?

Now conversion works without errors. Thank you!

But when it translates it continue to generate long rubbish sequences

jorgtied commented 3 years ago

This is strange! I just downloaded the MarianNMT model and that behaves as expected:

echo  "Onko se kohteliaisuus vai loukkaus?" | ./preprocess.sh fin source.spm | marian-decoder -c decoder.yml
...
[2021-11-06 22:19:33] Best translation 0 : ▁I s ▁it ▁a ▁compliment ▁or ▁an ▁insult ?

Could you verify that this also works for you? I didn't check a converted model yet. It may be something with the conversion script that does not work well together with the new version of the vocab file ....

avostryakov commented 3 years ago

I think the problem is with the generation of stop tokens. So, translation is correct from the beginning but it doesn't stop

jorgtied commented 3 years ago

Together with the guys from huggingface we found the problem. The fin-eng model should now work correctly. Could you, please, double check? For the new conversion you need to pull the latest version of the transformer library.

avostryakov commented 3 years ago

Wow, what is a surprise that you fixed it! I see it's this fix in transformers: https://github.com/huggingface/transformers/commit/b48faae364d4eeac56c25c2fa9abb60599b96933

Helsinki-NLP / Tatoeba-Challenge

Special characters are not allowed when converting the last fin-eng model to pytorch #16