ArchivesPortalEuropeFoundation / Topic-Detection

Using machine learning approaches for automatic topic detection in a multilingual environment
6 stars 0 forks source link

Select one initial collection to add #9

Open fedenanni opened 3 years ago

fedenanni commented 3 years ago

maybe Genealogies in Latvian and French

martamu commented 3 years ago

full topics list.xlsx

Hey guys here is the files with the full topics list and the language - the new 9 topics overall could be:

GENEALOGY (again, but adding Latvian to the languages)

Because they are in both French and German: ARCHITECTURE CHARTERS CONCENTRATION CAMPS DEMOCRACY

Because it is only in German GDR PARTIES AND TRADE UNIONS

And the largest topics of those available in French only: EDUCATION PHOTOGRAPHY TRANSPORT

Looks like a good combination!

fedenanni commented 3 years ago

@martamu nice thanks! Maybe we could start with Genealogy + one in both languages, one in german and one in French. Let @stefanpapp-ape knows which ones and we can create .json outputs, like with did last time

kerstarno commented 3 years ago

@martamu and @fedenanni -

thanks for the list and the input with regard to how to narrow it down for starters. May I suggest that we go for the following:

I would leave out the one topic remaining to only cover German (GDR Parties and Trade Unions) as I don't think it gives us much more to play with in terms of the topic as well as in terms of the language than what we already had in the first round with including the topic of GDR (German Democratic Republic).

Instead, I'd suggest adding the topic Health, which - while again only covering French - would also be a good one to work with in terms of a workshop or a research project given the current global situation that certainly will continue being with us for some while.

@stefanpapp-ape - could you please generate an up-to-date JSON export for "Genealogy", "Democracy", "Transport", and "Health" and save these in our shared folder on Google under "WorkingFiles > JSON_Exports"? Thanks.

stefanpapp-ape commented 3 years ago

@fedenanni Files are already available in said folder. Feel free to shout at any time if more are needed.

fedenanni commented 3 years ago

@stefanpapp-ape ah wonderful thanks!

fedenanni commented 3 years ago

Ok - fully processed all our jsons (old and new) - these are the languages either as a metadata information or detected with a tool:

[('ger', 60279), ('fr', 56014), ('de', 14187), ('fre', 7521), ('fi', 2705), ('it', 303), ('pl', 140), ('sv', 53), ('en', 36), ('lav', 28), ('lv', 21), ('heb', 14), ('es', 14), ('rus', 13), ('ca', 11), ('pol', 11)] note that some of them are duplicates (e.g., 'ger' and 'de') that I'll aggregate in a following step.

fedenanni commented 3 years ago

Acquired cross-lingual word embeddings for Hebrew, Swedish, Spanish, Russian Missing: Latvian, Catalan

fedenanni commented 3 years ago

Next step here is to discuss how to handle missing languages, see #12