Use LASER to improve data quality

argosopentech / argos-translate

Open-source offline translation library written in Python

https://www.argosopentech.com

MIT License

3.95k stars 287 forks source link

Use LASER to improve data quality #119

Open PJ-Finlay opened 3 years ago

PJ-Finlay commented 3 years ago

https://github.com/facebookresearch/LASER

PJ-Finlay commented 3 years ago

Creating a new issue to use LASER to improve data quality. LASER generates embedding for sentences in different languages that are semantically consistent between languages. This allows for determining how similar two pieces of parallel data are to remove bad data.

ArtanisTheOne commented 1 year ago

I use this [SentenceTransformers] to do the same thing with cosine similarity. It'll increase pre-processing times a ton [the speed is slow, even with optimization in the cosine calculation it's 1000ex/s] but it provides a decent filter.

You would use the multilingual models which are listed here

Mentioned it here in the forum