artitw / text2text

Text2Text Language Modeling Toolkit
https://discord.gg/eHaaUuWpTc
Other
285 stars 33 forks source link

Cross-lingual semantic retrieval #33

Open artitw opened 2 years ago

artitw commented 2 years ago

Perform a similar study to https://arxiv.org/pdf/1907.04307.pdf but expanding to support 100 languages using the embeddings from the translator.

Possibly start with the paper's code sample.

lere01 commented 2 years ago

@artitw

This looks interesting. Can I begin to look into this?

artitw commented 2 years ago

@lere01 thanks for your interest. I would recommend the following steps:

  1. Try out the code sample mentioned above to ensure that results from the paper are reproducible.
  2. Run the same process but use Text2Text embeddings for 100 languages.
  3. Try different types of Text2Text embeddings: (a) neural, (b) TF-IDF and (c) BM-25. We can also ensemble all of them.
  4. Share your findings; report on any improvements and other things you learned.

Let us know what you think, and if you have other ideas.