rule-based cleaning before lang id

browsermt / students

Efficient teacher-student models and scripts to make them

Other

48 stars 19 forks source link

rule-based cleaning before lang id #26

Closed PinzhenChen closed 3 years ago

PinzhenChen commented 3 years ago

Remove empty lines before running language ID to prevent tools/langid-fasttext.py from complaining about empty lines.

XapaJIaMnu commented 3 years ago

I am concerned that this might be very good for WMT but bad for non-WMT use cases.

Frequently in the wild the user might input sentences that contain some foreign language words (like original Chinese/Cyrillic spelling of names). Often, a model can learn to just copy these symbols to the target, however if the symbols are missing entirely from the vocab it, can never perform a copy.

PinzhenChen commented 3 years ago

I agree with XapaJIaMnu. Now the code is commented out. This can be used with some args.

XapaJIaMnu commented 3 years ago

Maybe you forgot to push?

PinzhenChen commented 3 years ago

The code to removing non-Latin is in train-student/clean/tools/clean-parallel.py which is commented out.

Changes made to train-student/clean/clean-corpus.sh is to remove empty lines before predicting the language.