Closed fedenanni closed 3 years ago
For each document we have the following fields:
id : C122306260
parentId : C122306134 unitTitle : Affaire Gigante : propriété, quartier du Défends à Aix. spell : ['Affaire Gigante : propriété, quartier du Défends à Aix.', 'Gigante'] levelName : clevel startDate : 1960-01-01T00:00:01Z endDate : 1980-12-31T23:59:59Z alternateUnitdate : s.d. dateType : normal country : FRANCE:G:2 repositoryCode : FR-FRAD013 langMaterial : fre ai : Archives départementales des Bouches-du-Rhône:1879 recordId : FRAD013_151J_01 dao : False numberOfDao : 0 numberOfDaoBelow : 0 numberOfDescendents : 0 numberOfAncestors : 3 other : Gigante titleProper : 151 J - Pourtal, expertises comptables (1815-1983):F684632 recordType : fa topic : ['economics'] F0_s : 151 J - Pourtal, expertises comptables (1815-1983):G:F684632 FID0_s : F684632 F1_s : 00000001:Affaires traitées par M. Pourtal:G:C122305060 FID1_s : C122305060 F2_s : 00000010:Dossiers n° 1300 à 1469.:G:C122306134 FID2_s : C122306134 AID1_s : A1879 AID0_s : A407 A1_s : Archives départementales des Bouches-du-Rhône:L:A1879 A0_s : Archives départementales:G:A407 orderId : 125 openData : False timestamp : 2019-06-23T21:51:22.918Z
I could use the langMaterial
for the language. Which field should I use for the textual content?
langMaterial
is the best indication of the language - even though it's the language of the material rather than the language of the description, which would be what we'd be after more. Unfortunately, the version of EAD that we are using does not include an element (or attribute) to capture the language of decsription.
As for the textual content, we might want to consider a combination of the following:
unitTitle
- the title of the "item" itself;titleProper
- the title of the finding aid within which the "item" is describedF0_s
, F1_s
, F2_s
etc. - the title(s) of the upper levels within the collection/fonds, i.e. the collection context of the "item"I think, there'd also be a field for the scope and content note of the "item", although this is not present in this example, hence I am not quite sure about the name. Will have a look and add later on.
P.S.: Should be scopeContent
, but maybe @stefanpapp-ape could confirm.
@kerstarno @stefanpapp-ape hi! yes, we have scopeContent
from time to time. Quick overview below on how many entries have scopeContent
and percentage over the total of entries in that .json
economics.json 9,598 6.66%
germanDemocraticRepublic.json 55,104 47.03%
slavery.json 657 85.88%
frenchNapoleonI.json 4 66.67%
catholicism.json 348 23.22%
Here are some stats on the langMaterial
field:
economics.json 5,060 3.51 %
germanDemocraticRepublic.json 95,455 81.46 %
slavery.json 0 0.0 %
frenchNapoleonI.json 0 0.0 %
catholicism.json 0 0.0 %
It might be easier to just detect the language from the text.
We currently have 263,482
elements in a dataframe structured like this:
["id","langMaterial","unitTitle","titleProper","scopeContent","topic","filename"]
Of them, 65,679
have a scopeContent
(which is the main element for topic detection), with this distribution:
'germanDemocraticRepublic.json', 55,073
'economics.json', 9,596
'slavery.json', 657
'catholicism.json', 348
'frenchNapoleonI.json', 4
I'll start with them with something unsupervised due to the high unbalanced setting (see #2 )
Hi @fedenanni
Thanks for the statistics. Very interesting. I agree with your evaluation regarding langMaterial
. It's a tricky aspect of archival description, as - more often than not - archival institutions will mainly hold material written in the official language(s) of their countries and would also describe their material in the official language(s) of their countries. Hence it's seldom that you would find the language of the material specified as part of the description.
I'm assuming that the relatively high percentage for the GDR topic stems from the circumstance that there might be quite some material in Russian included. Or it could be related to a slightly different description tradition in the institutions where these materials were originally held.
It's a pity that there's a relatively low percentage of elements with scopeContent
in this first dataset - roughly 25% if I'm calculating correctly.
Might it be useful, if we would seek to identify a topic that has a higher percentage of scopeContent
notes? Not sure, how easy this is to determine @stefanpapp-ape?
Hi! I am using a simple python library for checking the language of elements that do not have langMaterial
, by concatenating the titles and the scopeContent
. Here are the results:
[('de', 55,079), ('fr', 10,581), ('en', 12), ('it', 5)]
I could then start only with word embeddings in French and German for the moment.
Quick overview of each json file to understand basic structure