Open ale0xb opened 6 years ago
I toyed with the data and ES ingestion in the last few days. So here some comments:
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"location_as_geo_shape": {
"match_mapping_type": "object",
"match": "location",
"mapping": {
"type": "geo_shape"
}
}
},
{
"locationCenter_as_geo_point": {
"match_mapping_type": "string",
"match": "locationCenter",
"mapping": {
"type": "geo_point"
}
}
},
{
"listPlace_as_nested": {
"match_mapping_type": "object",
"match": "listPlace",
"mapping": {
"type": "nested"
}
}
}
]
}
}
}
ElasticSerch mappings dictate how the data is ingested and subsequently indexed by the search engine. They need to be carefully designed in order to allow the intended kind of analysis we want to achieve here.
In the following table I list the different attributes from the TEI and MySQL data sets that should be part of the index.
From TEI-XML-2018
General attributes
<form type="hauptlemma"><orth>Tina</orth></form> <orth type="normalized">Stadeltor</orth>
<gramGrp><pos>Subst</pos></gramGrp>
<pron notation="tustep">tinás</pron>
<sense corresp="this:LT1"><def xml:lang="de">großes, stehendes Faß</def></sense>
<ref type="quelle">Oberkappel/ Lemb.OÖ Gabriel<certainty assertedValue="uncertain" locus="value"/></ref>
<ref type="quelleBearbeitet">{X} Etym. SCHNEIDER· (1963)</ref>
quelle (*): Two or more mapping types means the original data is split into several fields.
Geo attributes
Effective spatial localization of the TEI-XML-2018 sources can be achieved by making use of geo_point and geo_shape ES special mapping attributes. These two mapping types are created from GeoJSON data that is not directly available on the xml files and therefore an extra middle step is required in order to produce it from usg-geo TEI tags (the process is roughly described in #23).
Data describing the oeaw's-specific geo-aware hierarchy can be found under some xml entries and looks like this (when available):
The challenge here is to capture this hierarchy in the final ES documents without compromising accuracy and/or performance of the final visualizations. There are 5 hierarchy levels as seen in the previous excerpt:
I think the correct approach here is to denormalize this geographical hierarchy in the different documents, producing something like this:
This would allow fast spatial querying in many ways, although I can expect this index will take up quite a big amount of space (several GB).
There are some issues that aren't discussed in this issue regarding other attributes that could be included in the index (authors, etc) that we need to address before moving forward.
Edit: Added quelle as per @amelieacdh's request