Open fedenanni opened 3 years ago
Hey guys here is the files with the full topics list and the language - the new 9 topics overall could be:
GENEALOGY (again, but adding Latvian to the languages)
Because they are in both French and German: ARCHITECTURE CHARTERS CONCENTRATION CAMPS DEMOCRACY
Because it is only in German GDR PARTIES AND TRADE UNIONS
And the largest topics of those available in French only: EDUCATION PHOTOGRAPHY TRANSPORT
Looks like a good combination!
@martamu nice thanks! Maybe we could start with Genealogy + one in both languages, one in german and one in French. Let @stefanpapp-ape knows which ones and we can create .json outputs, like with did last time
@martamu and @fedenanni -
thanks for the list and the input with regard to how to narrow it down for starters. May I suggest that we go for the following:
I would leave out the one topic remaining to only cover German (GDR Parties and Trade Unions) as I don't think it gives us much more to play with in terms of the topic as well as in terms of the language than what we already had in the first round with including the topic of GDR (German Democratic Republic).
Instead, I'd suggest adding the topic Health, which - while again only covering French - would also be a good one to work with in terms of a workshop or a research project given the current global situation that certainly will continue being with us for some while.
@stefanpapp-ape - could you please generate an up-to-date JSON export for "Genealogy", "Democracy", "Transport", and "Health" and save these in our shared folder on Google under "WorkingFiles > JSON_Exports"? Thanks.
@fedenanni Files are already available in said folder. Feel free to shout at any time if more are needed.
@stefanpapp-ape ah wonderful thanks!
Ok - fully processed all our jsons (old and new) - these are the languages either as a metadata information or detected with a tool:
[('ger', 60279), ('fr', 56014), ('de', 14187), ('fre', 7521), ('fi', 2705), ('it', 303), ('pl', 140), ('sv', 53), ('en', 36), ('lav', 28), ('lv', 21), ('heb', 14), ('es', 14), ('rus', 13), ('ca', 11), ('pol', 11)]
note that some of them are duplicates (e.g., 'ger' and 'de') that I'll aggregate in a following step.
Acquired cross-lingual word embeddings for Hebrew, Swedish, Spanish, Russian Missing: Latvian, Catalan
Next step here is to discuss how to handle missing languages, see #12
maybe Genealogies in Latvian and French