ec-doris / kohesio-backend

APIs serving Kohesio's frontend
https://kohesio.ec.europa.eu
6 stars 2 forks source link

Multilingual semantic search? #163

Closed madewild closed 1 year ago

madewild commented 2 years ago

Fair point from Polish translator today: "Will semantic search also work in other languages. If yes, should I translate the examples? Even if semantic search works in all languages, I can only guess if the example will be valid. If semantic search does not work in translation what should we do with the sentence which describes it? Should we leave the English words?"

For now we will say it supports only English, but we should investigate the possibility to offer semantic search in 24 languages... @faustusdotbe @svili any idea for this? @raphdom suggested translating the whole H2020 corpus with eTranslation and training 24 w2v models. :)

It will also be harder to evaluate the quality...

svili commented 2 years ago

Never thought I'd use it again but a long time ago my master thesis (can try to dig up the full from somewhere) was about learning a mapping between vectorfields of different languages on parallel corpus... We already have the translation of the projects we could use as a parallel corpus, and then its just a matter of finding a good metric to learn the mapping. Evaluating the quality on the different language models is not much of a problem in my mind, if we take the english recommendations as gold truth. But this is just at first blink, it might not be applicable here.

drvenabili commented 2 years ago

Right out of the gate I would say that the easiest would be to simply translate the words we already have in our EN model, and sticking the vectors to their translation. But of course it's not ideal because that's not how language works, and we'd probably have issues with Finnish and its 15 cases or Hungarian, etc.

@svili Why do we need to map embeddings cross-lingually? I assume that we will have 24 independent indices, and that a model will only be used for its own language, right? Not linking them will result in difficult evaluation and different results across languages, though.

On a larger note: given the poor quality/skewed balance of the text description in EN, I doubt their translation will give good models. I would be more comfortable with Raph's idea of translating h2020, tbh,

svili commented 2 years ago

@svili Why do we need to map embeddings cross-lingually? I assume that we will have 24 independent indices, and that a model will only be used for its own language, right? Not linking them will result in difficult evaluation and different results across languages, though.

Yea, you'r probably right, simply translating H2020 might be good enough, and in the end we would still have to maintain the same number of models so no advantage for the mapping in that regard.

D063520 commented 1 year ago

@faustusdotbe prepared this, @DiaZork can you please integrate the new multilingual API?

drvenabili commented 1 year ago

🥳