acdh-oeaw / wugsy

Crowdsourcing language data
MIT License
1 stars 3 forks source link

Import geo-aware TEI dataset in ES, link to questionnaire data and augment with spatial shapes #23

Open ale0xb opened 6 years ago

ale0xb commented 6 years ago

This is the step 2 of #19

I'm opening this issue to track the progress of combining these two datasets (plus possibly a third one providing the geo-shapes) into a new one, which in turn will serve as an entry point for more sophisticated data visualisations.

Geo-aware hierarchy

In the past there was no way to link the dictionary sources to places in the map. However, recent advances on the data curation of the dboe dataset have produced a geo-aware hierarchy of sources based on alphanumeric strings ("sigles") present in the original data. An example hierarchy is as follows

<listPlace xml:id="sigle:1A.3c01">
    <place type="Bundesland">
      <placeName>STir.</placeName>
      <idno>1A</idno>
      <listPlace>
        <place type="Großregion">
          <placeName>mSTir.</placeName>
          <idno>1A.21A.3</idno>
          <listPlace>
            <place type="Kleinregion">
              <placeName>Jaufengeb.</placeName>
              <idno>1A.3c</idno>
              <listPlace>
                <place type="Gemeinde">
                  <placeName>Ratschings</placeName>
                  <idno/>
                  <listPlace>
                    <place type="Ort">
                      <placeName>Ratschings, Racines</placeName>
                      <idno>1A.3c01</idno>
                    </place>
                  </listPlace>
                </place>
              </listPlace>
            </place>
          </listPlace>
        </place>
      </listPlace>
    </place>
</listPlace>

In the excerpt a top-down hierarchy of places related to the sigle "1A.3c01" is presented. For example the place this sigle refers to can be found under the last placeName tag, Ratschings/Racines, a municipality in South Tyrol, Italy.

The rest of more general administrative divisions can be found in the enclosing tags, each one having shorter idno (sigles) prefixes as you go up the tree.

This information can be found under the "usg" tag on each record. Theoretically, I have been informed this information could be augmented with actual geo information extracted from MySQL and indexed using ES geo-shapes.

Link with questionnaire data

Linking with questionnaire is kinda of more cumbersome process and not all TEI records will have a direct correspondence to a question. There are current efforts to migrate the rest of these questionnaires but it hasn't happened yet. Maybe someone who knows this stuff can look into this @amelieacdh?

For example we can find tags like: (from db182_qdb-TEI-02.xml)

<ref type="fragebogenNummer">60F55: Holzstoß (Zain); Pl./Dem.; Füg./Ra.</ref>

(from I537_qdb-TEI-02.xml)

<ref type="fragebogenNummer">86B26 (I,II): Lavendel = Lavandula angustifolia</ref>

which can be mapped respectively to questions (inside frage-fragebogen-full-tgd01.xml):

<item n="F55" xml:id="Fb-60-F55">
               <label>Waldarbeit</label>
               <pc>:</pc>
               <seg ana="question">
                  <seg xml:id="d1e223582">Holzstoß</seg>
                  <pc>(</pc>
                  <seg xml:id="d1e223586">Zain</seg>
                  <pc>)</pc>
                  <pc>;</pc>
                  <seg xml:id="d1e223592">Pl.</seg>
                  <pc>/</pc>
                  <seg xml:id="d1e223596">Dem.</seg>
                  <pc>;</pc>
                  <seg xml:id="d1e223600">Ra.</seg>
               </seg>
            </item>

and

<item n="B26" xml:id="Fb-86-B26">
               <label>Gartenblume</label>
               <pc>:</pc>
               <seg ana="question">
                  <seg xml:id="d1e368579">Lavendel</seg>
               </seg>
</item>

Sum up

As I see it In order to have this all working we need:

I will open sub-issues from here so we have everything nice and tidy @interrogator.

ameliedorn commented 6 years ago

Currently the frage-fragebogen-full-tgd01.xml contains only the systematic questionnaires. the questions from the other questionnaires (EFb and MüWi) are currently available in the MySQL database. If needed for datavis, they'd need to be included in the frage-fragebogen-full-tgd01.xml. Can we have a look at this @interrogator and @ale0xb ?

interrogator commented 6 years ago

Wednesday I'll have a lot of time to get into this. Will you guys be around that day (at least on Slack?)

ameliedorn commented 6 years ago

Wednesday morning I can't, but I'm available all afternoon after 2pm.

ale0xb commented 6 years ago

It should be doable then to extract the missing questions from the mysql and run a simple tokenizer / concept extractor to make them available in the same format as the others in frage-fragebogen-full-tgd01.xml. @amelieacdh @interrogator Let's discuss this on wednesday afternoon (4pm)

interrogator commented 6 years ago

Sounds good with me!

ameliedorn commented 6 years ago

Sounds good, chat then!

ameliedorn commented 6 years ago

@interrogator today after 1pm we will go through this and related issues with @ale0xb and myself - could you please join us?

interrogator commented 6 years ago

hey, yep, i’m coming in this afternoon. i have an appointment at 12 though, so i might arrive at öaw a bit after 13:00 sorry

On 9 Mar 2018, at 9:47 am, amelieacdh notifications@github.com wrote:

@interrogator https://github.com/interrogator today after 1pm we will go through this and related issues with @ale0xb https://github.com/ale0xb and myself - could you please join us?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/acdh-oeaw/wugsy/issues/23#issuecomment-371750465, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ_G3DYjvfBGDWdtSUG0GL1T7SyxYH2bks5tckGlgaJpZM4SYxxb.

ameliedorn commented 6 years ago

great, thanks! talk to you then!

simar0at commented 6 years ago

In the excerpt a top-down hierarchy of places related to the sigle "1A.3c01" is presented. For example the place this sigle refers to can be found under the last placeName tag, Ratschings/Racines, a municipality in South Tyrol, Italy.

Please rather use the @ref/@xml:id attribute at the root listPlace of the structure.