ArchivesPortalEuropeFoundation / Topic-Detection

Using machine learning approaches for automatic topic detection in a multilingual environment
6 stars 0 forks source link

Entities not appearing in corpus - errors? #72

Open fedenanni opened 2 years ago

fedenanni commented 2 years ago

The following entities do not seem to appear in the dataset. Is it possible or is this still a bug and we know we have occurrences of them in the corpus?

Benedetto Croce
Transport*
Inquisición
underground AND london
Inquisition
bretton woods
Петр I
Russian literature' 
Russian literature
Russian AND literature
dalmine
rocca AND agostino 
rocca agostino
tenaris
steinthal
heritage
techint
Napoleon I AND World War I
heritage and transport
Kodály Zoltán
Kodály
Beethoven music
Ford AND Motor
Ford AND Motor AND Company
Beethoven AND zene
Beethoven AND Musik
LRS
Beethoven UND Musik
Slovenian
Pope Paul
Pope and Paul
Pope AND Paul
Россия AND армия
eresia AND chiesa
Enver*
transport corridor Germany
Hospital Germany
German AND army
Ministerium* für Staatssicherheit
Keyn*
keynes
notaires
eric hobsbawm
eric and hobsbawm
eric AND hobsbawm
Beethoven AND family
1900 AND Paris
Beethoven* AND family
Dalmine
KARTE AND Napoleon
paris conference
1900 AND avril
Liquidation AND Deutschland
battle AND somme
Napoleon*
napoleon AND maps
Prešern
Samuelson
franz ferdinand
cathedral AND window
25 AND avril
napole*
Pope AND Francis
Marsaillaise
Schlacht AND somme
greppi
Wilhelm AND Kaiser
German Democratic Republic
Arhiv Republike Slovenije
First AND May
May AND First
May OR First
Krankenhaus AND Germany
First OR May
Développement et Afrique et europe
Portugal AND Lisbon
Autres instruments de recherche
Développement and Afrique and Europe 
Développement and Afrique 
Familienwahlrecht
Kaiser AND empereur
Russia AND Napoleon
Hortense de Beauharnais
mercedarian*
winston churchill
Greppi
fedenanni commented 2 years ago

For each of these we should do an additional test and run them with broad search (see the example with #70 where we actually had mentions of Churchill). For the remaining ones we need manual checks to finally exclude possible bugs.

fedenanni commented 2 years ago

It seems a few of them might lead to results in other settings.

fedenanni commented 2 years ago

Current status:

Next step:

fedenanni commented 2 years ago

Final summary of this issue:

Петр I

Russian literature

Kodály Zoltán

Kodály

LRS


For the moment we handle this situation by hardcoding cutoff on number of aliases ([d3db4c3](https://github.com/ArchivesPortalEuropeFoundation/Topic-Detection/commit/d3db4c39f141d5f778513acca9f5592a6c50b3e9)`)
- the rest are missing both using normal and broad search.
fedenanni commented 2 years ago

These should be double-checked to make sure they are not present in the full collection on APE:

Autres instruments de recherche
Familienwahlrecht
heritage and transport
Développement and Afrique and Europe 
Samuelson
Beethoven music
steinthal
Pope Paul
greppi
Arhiv Republike Slovenije
Pope and Paul
heritage
Inquisición
Dalmine
First OR May
notaires
Slovenian
transport corridor Germany
Développement et Afrique et europe
Benedetto Croce
tenaris
Inquisition
franz ferdinand
paris conference
Kodály Zoltán
Kodály
Russian literature
rocca agostino
eric hobsbawm
German Democratic Republic
Greppi
Développement and Afrique 
techint
keynes
Beethoven UND Musik
Петр I
Hortense de Beauharnais
eric and hobsbawm
LRS
Prešern
bretton woods
dalmine
Hospital Germany
Russian literature' 
Marsaillaise
kerstarno commented 2 years ago

Thank you, @fedenanni. I probably won't get around to any testing in this regard during this week, but might be able to make some time for it next week.

fedenanni commented 2 years ago

@kerstarno no problem - this is just a final check before considering all tests completed. I believe the main issue for broad_search is that timeout that we get when too many candidates are retrieved. That's why the method becomes so slow in certain settings