SentencePiece models training

avostryakov commented 4 years ago

Hi!

Can you be so kind as to answer a question about SentencePiece (spm) models training? You train separated models for source and target languages (on the source and target sentences respectively) but you have one vocabulary file. I don't understand this moment. Here I see a recommendation to train one spm model and one vocabulary file: https://github.com/google/sentencepiece#vocabulary-restriction. Anyway, how do you create one vocabulary file for two spm models?

avostryakov commented 4 years ago

Ok, I found that you created two SentencePiece models for source and target languages. And vocab file you created with marian-vocab utility on a concatenation of source and target texts. So, the only question is: did you try to create a combined SentencePiece model? It's recommended in Sentencepiece and subword-nmt for MT systems.

jorgtied commented 4 years ago

Yes, I did some experiments with joined sentencepiece models and the main reason why I went away from that is that I didn't want to throw very different languages and writing systems into one model. For me, it makes intuitively more sense to give each language the same capacity in terms of vocabulary items then risking that statistics from one language would dominate when creating the language models. I don't have really empirical proof but in the end it may not make such a big difference anyway. But also consider that for some languages like Chinese the situation is quite different than for languages with alphabetic writing systems and also sentence piece settings may differ slightly to better fit the properties of the character set. I do have a heuristic to slightly adjust the training parameters for sentence piece depending on the size of the observed character set in the training data. Does that explain the decision well enough? I'd be interested in more systematic studies that would show advantages and disadvantages of different approaches ....

avostryakov commented 4 years ago

Yes, it's clear enough. Especially for really different languages. I just thinking about one use case with Glossary when the combined SentencePiece model can be helpful. Imagine a situation when I want to correct translation of some phrases and I replace these phrases with correct translations to the target language (from the glossary) right before the translation model. So, the model sees a combination of languages as an input. fairseq models are completely ok with it and don't touch "translated" phrases (just copy it to output). Maybe I'm wrong but maybe combined SentencePiece model will work better in these situations

avostryakov commented 4 years ago

Sometimes input sentence include two languages inside naturally.

Helsinki-NLP / OPUS-MT-train

SentencePiece models training #22