Open PJ-Finlay opened 3 years ago
Creating a new issue to use LASER to improve data quality. LASER generates embedding for sentences in different languages that are semantically consistent between languages. This allows for determining how similar two pieces of parallel data are to remove bad data.
I use this [SentenceTransformers] to do the same thing with cosine similarity. It'll increase pre-processing times a ton [the speed is slow, even with optimization in the cosine calculation it's 1000ex/s] but it provides a decent filter.
You would use the multilingual models which are listed here
Mentioned it here in the forum