Open kpu opened 3 years ago
This is a problem. We're not consistent between training and test. We're also creating the impression to Mozilla that this file doesn't exist when it needs to, which will bite us later.
Cleanest solution is probably to ship the file with the MT models. Or (and this is crazy) stuff it in the yaml somehow.
The non-breaking prefixes file for the sentence splitter depends on the source language. We should bind this to the model somehow (i.e. by knowing what language it is translating). Otherwise the model will be confused when it sees the wrong sentence split and has a mismatch with training.
I'm beginning to think we should have a unified binary file like @XapaJIaMnu was suggesting.