Semantics for Topothek - Githubissues

TPalfinger commented 6 years ago

As we aim for a cooperation with the topothek we need to tackle some questions. The cooperation will be defined by technical (automation vs. adding value to the collection with human labour) , scientific (how is the cooperation helping science?) and societal aspects (what is the benefit of each individual of this cooperation? How and why should human labour be used?) for which we might want to have answers for the presentation and to define the common goals.

General:

What are we planing to do? a. What can science get from it?
- formulate concrete research questions to show potential? b. What can they get from it? (some sort of thesaurus?)
How do we want to integrate that into their platform? a. technical aspects b. human labour aspects
What are the expectations in the long run?

Risks:

Rejection because of potential new workload for members of Topothek
opposition against potential technical changes (persistence of IT)

presentation:

explanation of the benefits (e.g. whole phrases will enhance scientific access to the database but also the workflow @ Topothek
Focus on the value of human work and the salvation of unnecessary tasks by automation

interrogator commented 6 years ago

Thanks for this write-up @TPalfinger!

I started putting in some code at nlp.py that is going to do some basic processing of the Kommentar field data. What I want it to do is to encourage people to submit free text with images (this adds semantic/discurisve/emotional/cultural knowledge that can't be found in other ways). I want to then use this data to auto-generate tags and tag suggestions, minimising manual time spent on boring/automatable tasks.

The workflow:

Initialise a parser and word vector model for German (or English)
Run this parser over a given Kommentar field
For each content word (nouns, verbs, etc.), analyse for synonyms, hyponyms, collocates
Weight each analyse by its centrality in a dependency analysis of the sentence
Recognise named entity spans and named entity types
Combine the analysis, so that one Kommentar gets:
- a list of named entities and their types, which will be auto-added as tags
- A list of similar words and their similarity scores
- A list of hypernyms and hyponyms

So, for a given example:

By the end of April 1999, about 600,000 residents of Kosovo had become refugees

We would get some output like:

{
    "named_entities":
        {
            "time": ["April", "1999", "April 1999"],
            "location": ["Kosovo"]
        },
    "similar_words":
        {
            "migration": 8.532,
            "inhabitants": 7.241,
            "Balkans": 5.123
        },
    "hypernyms":
        {
            "refugees": ["migrants", "people", "entities"],
        }
    }
}

This (tiny, hand-written) example could be used to auto-tag the locations and times, to suggest tags (with higher numbers meaning bolder/larger text), and to allow hovering over Kommentar words to get related terms.

The longer the Kommentar, the better the result. Also, we could use all Kommentar fields combined as the reference corpus, which will iteratively improve results.

TPalfinger commented 6 years ago

The topothek has three rationales and one goal: The goal is to collect knowledge and material and make it accessible across disciplines. They want to achieve this through geographic(?) contextualization, tagging and dating.

Especially the tagging rationale may cause problems ("contextualization" seems to mean geographical and not social contextualization): right now they explicitly train their members to not use sentences for the tagging task. This seems to be an issue at the Topothek as this is part of their handbook. This means our approach might go against their internal logic. Although we are going for the commentary section, they might claim that a lot of effort was put into teaching people to use short terms instead of sentences and now they can do it vice versa (theoretically using sentences for tagging) "ruining" their whole structure.

Commentary field: Right now the commentary field is used by the Topothek to give "a general description of the content of the picture and should reflect the peculiarity of it". This means instead of: "The city gate was first in 1483 mentioned in a document. As a builder Max Mustermann is mentioned in the cadastre of 1495..."

They want:

"The city gate before renovation in 1993."

in the commentary field. This might cause troubles to the "as much text as possible maxim" of our approach.

TPalfinger commented 6 years ago

Does this corpus make our life easier? NEGRA Corpus for parser models in German

interrogator commented 6 years ago

I think their current training approach is a reasonable one, and yeah, our approach obviously violates it. Their current approach however, doesn't consider the idea of computer assisted/augmented data entry, which is what we're proposing. Tagging is a lower-level task than describing and explaining, and thus is inherently the easier task for humans. If Topothek sees the value in natural language commentary in general (i.e. for qualitative humanities and social science research), and does not see it as a massively expensive/difficult thing to implement, then they should be able to get the best of both worlds.

Regarding NEGRA, I'm not sure if/how it will help, but I spent approx 1 year at Saarland Uni and can reach out personally if need be :)

interrogator commented 6 years ago

So now that we've had a discussion with them, it seems like there's some good and simple progress made here. They are open to the idea of user stories, so we can simply assume we take a paragraph of text and need to produce two analyses: one is word by word, and one is a model of the entire text

acdh-oeaw / wugsy

Semantics for Topothek #16