Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

How to get vocab.yml file when doing train->eval->dist #27

Closed orendar closed 3 years ago

orendar commented 3 years ago

Hey,

First of all I wanted to thank you for this amazing project.

I followed the instructions in the repo and set up the environment correctly and I can run train and eval as instructed without any problems. Release does not work for me, so I tried dist and that did work and packaged the model into a zip. However, it seems like the Huggingface script for converting Marian models to Pytorch models requires a vocab.yml file which is also present in all the pretrained Opus-MT models but is not present in my zip file - I only have src.vocab and trg.vocab files.

Could you please explain to me how to get the vocab.yml file, and whether it is done using any make commands or manually?

Thanks, Best, Oren

orendar commented 3 years ago

Nevermind, I saw someone else's comment about marian-vocab. It might be useful to add it to the documentation somewhere? Thanks!

orendar commented 3 years ago

Actually I'm going to reopen this as I'm still confused - I successfully created a vocab.yml file by concatenating the source and target vocabs and passing to marian-vocab, but when I try to convert my packaged model to Huggingface I get an error: Original vocab size {opus_state.cfg['vocab_size']} and new vocab size {len(tokenizer.encoder)} mismatched AssertionError: Original vocab size 32001 and new vocab size 61724 mismatched. Is there something I'm missing here? Should I ask this question over at Huggingface?

orendar commented 3 years ago

Looked at the commits and figured out that the old yml vocab can be produced if using USE_SPM_VOCAB=0 flag when creating the data. Does that mean that for now, I should train all models with that flag if I want to port them over to Huggingface?

jorgtied commented 3 years ago

Yes, sorry, this is what you need now but I will talk to the people at huggingface to also support the plain text vocab files that are taken from the sentence piece models. It's a bit of a moving target.

orendar commented 3 years ago

Great, thanks - happy to follow up with them or help with the porting if I can. Feel free to leave this issue up or close it if you think it's not directly relevant for this project, up to you.

jorgtied commented 3 years ago

Changed it now to have USE_SPM_VOCAB=0 as default. Seems more backward compatible with everything ...