ArchivesPortalEuropeFoundation / Topic-Detection

Using machine learning approaches for automatic topic detection in a multilingual environment
6 stars 0 forks source link

Concepts missing in embedding space #71

Open fedenanni opened 2 years ago

fedenanni commented 2 years ago

This is the list of words used as queries that did not appear in the embedding space so we weren't able to perform a search. Almost all of them look like mistakes, but we could perform an additional entity search if we don't find anything with concept search. To be discussed

רעוואלוציע he
Napolean* en
transpor fr
Fluchtlinge pl
Fluchtlinge pl
CatholicismANDheresy en
eresia en
catolico en
catolicismo en
1910 he
Notariat* en
Marsaillaise en
Reichskolonialamt%3AF1538359 de
greppi de
greppi es
greppi fi
greppi fr
greppi he
25-Apr en
Familienwahlrecht de
rolls-royce en
mercedarian* en
rolls-royce de
rolls-royce fr
päpstlich en
churchil en
dalmine en
fedenanni commented 2 years ago

Add a message, please make sure the search parameters are correct

kerstarno commented 2 years ago

To check Familienwahlrecht would be a correct term in German. And there is a Wikidata entry for it: https://www.wikidata.org/wiki/Q1364216

fedenanni commented 2 years ago

@kerstarno @Beacannelli Hi both! Now that the dev interface is up could you give a final test to these queries? Just let me know whether they are:

  1. Typos so we shouldn't retrieve results, like we currently do, see for instance the following one which combines a typo in Italian (it should be cattolico and the wrong selected language en): Screenshot 2022-05-02 at 11 21 44

  2. Errors of the system, so words that should return results but for instance we don't have a word embedding for it, for instance: Screenshot 2022-05-02 at 11 24 00 which returns results only if splitted Screenshot 2022-05-02 at 11 25 17