acdh-oeaw / wugsy

Crowdsourcing language data
MIT License
1 stars 3 forks source link

Semantics for Topothek #16

Open TPalfinger opened 6 years ago

TPalfinger commented 6 years ago

As we aim for a cooperation with the topothek we need to tackle some questions. The cooperation will be defined by technical (automation vs. adding value to the collection with human labour) , scientific (how is the cooperation helping science?) and societal aspects (what is the benefit of each individual of this cooperation? How and why should human labour be used?) for which we might want to have answers for the presentation and to define the common goals.

General:

  1. What are we planing to do? a. What can science get from it?
    • formulate concrete research questions to show potential? b. What can they get from it? (some sort of thesaurus?)
  2. How do we want to integrate that into their platform? a. technical aspects b. human labour aspects
  3. What are the expectations in the long run?

Risks:

  1. Rejection because of potential new workload for members of Topothek
  2. opposition against potential technical changes (persistence of IT)

presentation:

interrogator commented 6 years ago

Thanks for this write-up @TPalfinger!

I started putting in some code at nlp.py that is going to do some basic processing of the Kommentar field data. What I want it to do is to encourage people to submit free text with images (this adds semantic/discurisve/emotional/cultural knowledge that can't be found in other ways). I want to then use this data to auto-generate tags and tag suggestions, minimising manual time spent on boring/automatable tasks.

The workflow:

So, for a given example:

By the end of April 1999, about 600,000 residents of Kosovo had become refugees

We would get some output like:

{
    "named_entities":
        {
            "time": ["April", "1999", "April 1999"],
            "location": ["Kosovo"]
        },
    "similar_words":
        {
            "migration": 8.532,
            "inhabitants": 7.241,
            "Balkans": 5.123
        },
    "hypernyms":
        {
            "refugees": ["migrants", "people", "entities"],
        }
    }
}

This (tiny, hand-written) example could be used to auto-tag the locations and times, to suggest tags (with higher numbers meaning bolder/larger text), and to allow hovering over Kommentar words to get related terms.

The longer the Kommentar, the better the result. Also, we could use all Kommentar fields combined as the reference corpus, which will iteratively improve results.

TPalfinger commented 6 years ago

The topothek has three rationales and one goal: The goal is to collect knowledge and material and make it accessible across disciplines. They want to achieve this through geographic(?) contextualization, tagging and dating.

Especially the tagging rationale may cause problems ("contextualization" seems to mean geographical and not social contextualization): right now they explicitly train their members to not use sentences for the tagging task. This seems to be an issue at the Topothek as this is part of their handbook. This means our approach might go against their internal logic. Although we are going for the commentary section, they might claim that a lot of effort was put into teaching people to use short terms instead of sentences and now they can do it vice versa (theoretically using sentences for tagging) "ruining" their whole structure.

Commentary field: Right now the commentary field is used by the Topothek to give "a general description of the content of the picture and should reflect the peculiarity of it". This means instead of: "The city gate was first in 1483 mentioned in a document. As a builder Max Mustermann is mentioned in the cadastre of 1495..."

They want:

"The city gate before renovation in 1993."

in the commentary field. This might cause troubles to the "as much text as possible maxim" of our approach.

TPalfinger commented 6 years ago

Does this corpus make our life easier? NEGRA Corpus for parser models in German

interrogator commented 6 years ago

I think their current training approach is a reasonable one, and yeah, our approach obviously violates it. Their current approach however, doesn't consider the idea of computer assisted/augmented data entry, which is what we're proposing. Tagging is a lower-level task than describing and explaining, and thus is inherently the easier task for humans. If Topothek sees the value in natural language commentary in general (i.e. for qualitative humanities and social science research), and does not see it as a massively expensive/difficult thing to implement, then they should be able to get the best of both worlds.

Regarding NEGRA, I'm not sure if/how it will help, but I spent approx 1 year at Saarland Uni and can reach out personally if need be :)

interrogator commented 6 years ago

So now that we've had a discussion with them, it seems like there's some good and simple progress made here. They are open to the idea of user stories, so we can simply assume we take a paragraph of text and need to produce two analyses: one is word by word, and one is a model of the entire text