Sentence splitter non-breaking prefixes file

browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.

http://browser.mt

Mozilla Public License 2.0

340 stars 38 forks source link

Sentence splitter non-breaking prefixes file #104

Open kpu opened 3 years ago

kpu commented 3 years ago

The non-breaking prefixes file for the sentence splitter depends on the source language. We should bind this to the model somehow (i.e. by knowing what language it is translating). Otherwise the model will be confused when it sees the wrong sentence split and has a mismatch with training.

I'm beginning to think we should have a unified binary file like @XapaJIaMnu was suggesting.

kpu commented 3 years ago

This is a problem. We're not consistent between training and test. We're also creating the impression to Mozilla that this file doesn't exist when it needs to, which will bite us later.

kpu commented 3 years ago

Cleanest solution is probably to ship the file with the MT models. Or (and this is crazy) stuff it in the yaml somehow.