ArchivesPortalEuropeFoundation / Topic-Detection

Using machine learning approaches for automatic topic detection in a multilingual environment
6 stars 0 forks source link

Read json exports #1

Closed fedenanni closed 3 years ago

fedenanni commented 4 years ago

Quick overview of each json file to understand basic structure

fedenanni commented 4 years ago

For each document we have the following fields:

id : C122306260

parentId : C122306134 unitTitle : Affaire Gigante : propriété, quartier du Défends à Aix. spell : ['Affaire Gigante : propriété, quartier du Défends à Aix.', 'Gigante'] levelName : clevel startDate : 1960-01-01T00:00:01Z endDate : 1980-12-31T23:59:59Z alternateUnitdate : s.d. dateType : normal country : FRANCE:G:2 repositoryCode : FR-FRAD013 langMaterial : fre ai : Archives départementales des Bouches-du-Rhône:1879 recordId : FRAD013_151J_01 dao : False numberOfDao : 0 numberOfDaoBelow : 0 numberOfDescendents : 0 numberOfAncestors : 3 other : Gigante titleProper : 151 J - Pourtal, expertises comptables (1815-1983):F684632 recordType : fa topic : ['economics'] F0_s : 151 J - Pourtal, expertises comptables (1815-1983):G:F684632 FID0_s : F684632 F1_s : 00000001:Affaires traitées par M. Pourtal:G:C122305060 FID1_s : C122305060 F2_s : 00000010:Dossiers n° 1300 à 1469.:G:C122306134 FID2_s : C122306134 AID1_s : A1879 AID0_s : A407 A1_s : Archives départementales des Bouches-du-Rhône:L:A1879 A0_s : Archives départementales:G:A407 orderId : 125 openData : False timestamp : 2019-06-23T21:51:22.918Z

fedenanni commented 4 years ago

I could use the langMaterial for the language. Which field should I use for the textual content?

kerstarno commented 4 years ago

langMaterial is the best indication of the language - even though it's the language of the material rather than the language of the description, which would be what we'd be after more. Unfortunately, the version of EAD that we are using does not include an element (or attribute) to capture the language of decsription.

As for the textual content, we might want to consider a combination of the following:

I think, there'd also be a field for the scope and content note of the "item", although this is not present in this example, hence I am not quite sure about the name. Will have a look and add later on.

kerstarno commented 4 years ago

P.S.: Should be scopeContent, but maybe @stefanpapp-ape could confirm.

fedenanni commented 4 years ago

@kerstarno @stefanpapp-ape hi! yes, we have scopeContent from time to time. Quick overview below on how many entries have scopeContent and percentage over the total of entries in that .json

economics.json 9,598 6.66%

germanDemocraticRepublic.json 55,104 47.03%

slavery.json 657 85.88%

frenchNapoleonI.json 4 66.67%

catholicism.json 348 23.22%
fedenanni commented 4 years ago

Here are some stats on the langMaterial field:

economics.json 5,060 3.51 %

germanDemocraticRepublic.json 95,455 81.46 %

slavery.json 0 0.0 %

frenchNapoleonI.json 0 0.0 %

catholicism.json 0 0.0 %

It might be easier to just detect the language from the text.

fedenanni commented 4 years ago

We currently have 263,482 elements in a dataframe structured like this:

["id","langMaterial","unitTitle","titleProper","scopeContent","topic","filename"]

Of them, 65,679 have a scopeContent (which is the main element for topic detection), with this distribution:

'germanDemocraticRepublic.json', 55,073
'economics.json', 9,596
'slavery.json', 657
'catholicism.json', 348
'frenchNapoleonI.json', 4

I'll start with them with something unsupervised due to the high unbalanced setting (see #2 )

kerstarno commented 4 years ago

Hi @fedenanni

Thanks for the statistics. Very interesting. I agree with your evaluation regarding langMaterial. It's a tricky aspect of archival description, as - more often than not - archival institutions will mainly hold material written in the official language(s) of their countries and would also describe their material in the official language(s) of their countries. Hence it's seldom that you would find the language of the material specified as part of the description.

I'm assuming that the relatively high percentage for the GDR topic stems from the circumstance that there might be quite some material in Russian included. Or it could be related to a slightly different description tradition in the institutions where these materials were originally held.

It's a pity that there's a relatively low percentage of elements with scopeContent in this first dataset - roughly 25% if I'm calculating correctly.

Might it be useful, if we would seek to identify a topic that has a higher percentage of scopeContent notes? Not sure, how easy this is to determine @stefanpapp-ape?

fedenanni commented 4 years ago

Hi! I am using a simple python library for checking the language of elements that do not have langMaterial, by concatenating the titles and the scopeContent. Here are the results:

[('de', 55,079), ('fr', 10,581), ('en', 12), ('it', 5)]

I could then start only with word embeddings in French and German for the moment.