Closed PinzhenChen closed 3 years ago
I am concerned that this might be very good for WMT but bad for non-WMT use cases.
Frequently in the wild the user might input sentences that contain some foreign language words (like original Chinese/Cyrillic spelling of names). Often, a model can learn to just copy these symbols to the target, however if the symbols are missing entirely from the vocab it, can never perform a copy.
I agree with XapaJIaMnu. Now the code is commented out. This can be used with some args.
Maybe you forgot to push?
The code to removing non-Latin is in train-student/clean/tools/clean-parallel.py
which is commented out.
Changes made to train-student/clean/clean-corpus.sh
is to remove empty lines before predicting the language.
Remove empty lines before running language ID to prevent
tools/langid-fasttext.py
from complaining about empty lines.