ArchivesPortalEuropeFoundation / Topic-Detection

Using machine learning approaches for automatic topic detection in a multilingual environment
6 stars 0 forks source link

Bug with open / closed quotation marks #89

Open fedenanni opened 2 years ago

fedenanni commented 2 years ago

The Berliner Ensemble 'Coriolanus'

Coriolanus, von William Shakespeare in der Bearbeitung von Bertolt Brecht, 1964. Buhnenfassung des Berliner Ensemble. Musik: Paul Dellau. Regie: Manfred Wekwerth / Joachim Tenscher. Ausstattung: Karl v. Appen. Premiere: 24 September 1964.

Two volumes on Bertolt Brecht's adaptation ofCoriolanus, including the text of the Brecht-Shakespeare play and a volume containing 500-600 captioned photographs of the play in performance, staged by the Berliner Ensemble in 1964.

fedenanni commented 2 years ago

It links to this one: https://en.wikipedia.org/wiki/Coriolanus%27_Coriolanus

fedenanni commented 2 years ago

@kerstarno The problem is that the interface takes the input as a single piece of text (ignoring break-lines). If the input is provided with a "." after the title the entities are extracted correctly

The Berliner Ensemble 'Coriolanus'. Coriolanus, von William Shakespeare in der Bearbeitung von Bertolt Brecht, 1964. Buhnenfassung des Berliner Ensemble. Musik: Paul Dellau. Regie: Manfred Wekwerth / Joachim Tenscher. Ausstattung: Karl v. Appen. Premiere: 24 September 1964.

kerstarno commented 2 years ago

Hi @fedenanni,

thanks for having a look into this. I get the point about the line break and the "." making it clearer. So, that's fine in the original use case

However, checking the text you posted above, I've found a few other things, which I am going to list here for the time being, even though they point to different issues. I'll leave it to you to create new entries here on GitHub as needed.

There is one case, for example, where the use of "." as a delimiter is a little tricky: at the end, the text names the designer "Karl v. Appen", where "v." is the abbreviated "von". The tool currently interprets this as "Karl v." and leaves his last name out completely. This only happens when I specify the language as German, by the way. When I specify the language as English, "Karl v. Appen" is recognised correctly.

There are a few other observations when checking the text in English and in German and comparing them. In German, the tool detects:

In English, the tool detects:

Let's have a chat about these during our meeting later on today.