acdh-oeaw / wugsy

Crowdsourcing language data
MIT License
1 stars 3 forks source link

ES index and mapping #25

Open ale0xb opened 6 years ago

ale0xb commented 6 years ago

ElasticSerch mappings dictate how the data is ingested and subsequently indexed by the search engine. They need to be carefully designed in order to allow the intended kind of analysis we want to achieve here.

In the following table I list the different attributes from the TEI and MySQL data sets that should be part of the index.

From TEI-XML-2018

General attributes

name original type mapping datatype* Example
main lemma (hauptlemma) single (single-word) text (analyzed) <form type="hauptlemma"><orth>Tina</orth></form> <orth type="normalized">Stadeltor</orth>
part of speech string (single-word, categorical) keyword <gramGrp><pos>Subst</pos></gramGrp>
pronunciation (lautung) string (single) keyword <pron notation="tustep">tinás</pron>
sense (meaning) string (multi-word) keyword <sense corresp="this:LT1"><def xml:lang="de">großes, stehendes Faß</def></sense>
source (quelle) string (multi-word) keyword <ref type="quelle">Oberkappel/ Lemb.OÖ Gabriel<certainty assertedValue="uncertain" locus="value"/></ref>
revised source (quelleBearbeitet) string (multi-word) keyword, date (when available) <ref type="quelleBearbeitet">{X} Etym. SCHNEIDER· (1963)</ref>

quelle (*): Two or more mapping types means the original data is split into several fields.

Geo attributes

Effective spatial localization of the TEI-XML-2018 sources can be achieved by making use of geo_point and geo_shape ES special mapping attributes. These two mapping types are created from GeoJSON data that is not directly available on the xml files and therefore an extra middle step is required in order to produce it from usg-geo TEI tags (the process is roughly described in #23).

Data describing the oeaw's-specific geo-aware hierarchy can be found under some xml entries and looks like this (when available):

<usg type="geo" corresp="this:LT1">
  <placeName type="orig">Marling</placeName>
  <listPlace ref="sigle:1A.1h03">
    <place type="Bundesland">
      <placeName>STir.</placeName>
      <idno>1A</idno>
      <listPlace>
        <place type="Großregion">
          <placeName>wSTir.</placeName>
          <idno>1A.1</idno>
          <listPlace>
            <place type="Kleinregion">
              <placeName>Umg.Meran</placeName>
              <idno>1A.1h</idno>
              <listPlace>
                <place type="Gemeinde">
                  <placeName>Marling</placeName>
                  <idno/>
                  <listPlace>
                    <place type="Ort">
                      <placeName>Marling, Marlengo</placeName>
                      <idno>1A.1h03</idno>
                    </place>
                  </listPlace>
                </place>
              </listPlace>
            </place>
          </listPlace>
        </place>
      </listPlace>
    </place>
  </listPlace>
</usg>

The challenge here is to capture this hierarchy in the final ES documents without compromising accuracy and/or performance of the final visualizations. There are 5 hierarchy levels as seen in the previous excerpt:

level name nature mapping datatype expected mapping attribute name MySQL look-up table(s)
Bundesland (state) polygon geo_shape bundesland region, GISregion
Grossregion (big region) polygon geo_shape grossregion region, GISregion
Kleinregion (small region) polygon geo_shape kleinregion region, GISregion
Gemeinde (municipality) polygon geo_shape gemeinde gemeinde, GISgemeinde
Ort (place) point geo_shape ort ort, GISort

I think the correct approach here is to denormalize this geographical hierarchy in the different documents, producing something like this:

"_doc": {
      "properties": {
       [...]
        "bundesland": {
          "type": "geo_shape",
          "tree": "quadtree",
          "precision": "1km",
        },
        "grossregion": {
          "type": "geo_shape",
          "tree": "quadtree",
          "precision": "500m",
        },
        "kleinregion": {
          "type": "geo_shape",
          "tree": "quadtree",
          "precision": "50m",
        },
        "gemeinde": {
          "type": "geo_shape",
          "tree": "quadtree",
          "precision": "10m",
        },
        "ort": {
          "type": "geo_shape",
          "tree": "quadtree",
          "points_only" : true,
          "precision": "1m"
        }
      }
    }

This would allow fast spatial querying in many ways, although I can expect this index will take up quite a big amount of space (several GB).

There are some issues that aren't discussed in this issue regarding other attributes that could be included in the index (authors, etc) that we need to address before moving forward.

Edit: Added quelle as per @amelieacdh's request

simar0at commented 6 years ago

I toyed with the data and ES ingestion in the last few days. So here some comments:

  1. The mapping for the listPlace tree is a little bit too much I think: When we have shape data for this (and we do have some open data shapes from Austria and South Tyrol) I don't see why we would want to store the bigger regions for each entry. I would settle for the most detailed one. If someone selects everything within in some shape (e. g. Großregion) that selects all the data that is annotated with a shape within that region. So every gemeinde, ort, kleinregion and großregion. That is of course not true for points.
  2. Having geo_points in ES is a good idea because Kibana right now can do interesting things using points but has no visualizations available for shapes. I just used a centroid calculation and the shapes I got from my open data. With geo_points we need different layers so a point corresponding to a Großregion is selected or not by accident because it is within a shape that is part of a query or not.
  3. I did not want to reassess the data model we use now for the entries. I assumed that TEI dictionary is good enough for now. I just used a library to convert XML to JSON according to the badgerfish specification. Badgerfish is problematic when it comes to mixed content so such content is duplicated as a flattened string.
  4. I have no hands on experience with ES so I just let it decide what data mappings to use based on the ingested data. Is it really necessary to do specify this in such detail in advance? My solution is just three dynamic mappings:
    {
    "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "location_as_geo_shape": {
            "match_mapping_type": "object",
            "match":   "location",
            "mapping": {
              "type": "geo_shape"
            }
          }
        },
        {
          "locationCenter_as_geo_point": {
            "match_mapping_type": "string",
            "match":   "locationCenter",
            "mapping": {
              "type": "geo_point"
            }
          }
        },
        {
          "listPlace_as_nested": {
            "match_mapping_type": "object",
            "match":   "listPlace",
            "mapping": {
              "type": "nested"
            }
          }
        }
      ]
    }
    }
    }