browsermt / students

Efficient teacher-student models and scripts to make them
Other
48 stars 19 forks source link

Include non-breaking prefixes file for source language #35

Open kpu opened 3 years ago

kpu commented 3 years ago

Currently bergamot-translator is just not loading non-breaking prefixes https://github.com/browsermt/bergamot-translator/issues/104 . This is bad and should be fixed. I think the clean way to do this is to ship the file for the source language. They're small enough that some copying is probably ok.

jerinphilip commented 3 years ago

Can you bring the relevant nonbreaking_prefixes.xx into the archive, @XapaJIaMnu. I'll pick this up at BRT to include tests for https://github.com/browsermt/bergamot-translator/pull/172.

XapaJIaMnu commented 3 years ago

Where exactly do we get those from? Is that part off ssplit, @ugermann ?

kpu commented 3 years ago

They come from moses. https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes

ugermann commented 3 years ago

They actually ship with the sentence splitter and may diverge from Moses over time, as we add additional prefixes.