PolMine / dbpedia

R Wrapper for Corpus Annotation with DBpedia Spotlight
3 stars 0 forks source link

Handling overlapping annotations by DBpedia Spotlight #42

Open ChristophLeonhardt opened 8 months ago

ChristophLeonhardt commented 8 months ago

Issue

Occasionally, DBpedia Spotlight returns overlapping annotations.

Take the following example:

library(dbpedia)

doc <- "Der Deutsche Bundestag tagt in Berlin."

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint") # German endpoint
)

In phrases such as

"Der Deutsche Bundestag"

(found in GermaParl) both the entities "Der Deutsche Bundestag" and "Bundestag" are annotated. They might share the same URI but do not need to. Depending on the input format, this might cause different issues. For character vectors, this at least overestimates the number of unique entities (in the example above, there is only one instance of "Bundestag" but if we count the two URIs as two instances, this would not be correct in most cases). For CWB corpora, we currently do not have a way to encode these overlapping annotations.

In this issue, I'll demonstrate three variations of overlapping entity annotations. I think that the technical solution might be similar for all three scenarios. There are conceptual aspects to be discussed. The following considerations follow the assumption that we do not want to keep overlapping annotations but resolve these to a single annotation. Other solutions could be considered here.

Embedded Annotations

In the example above, "Der Deutsche Bundestag", one entity is completely embedded in the other. This could be resolved by controlling for overlapping entities and limiting the output to either the entity included in all annotations ("Bundestag"), the longest entity ("Der Deutsche Bundestag") or, using the scores provided by DBpedia Spotlight, the most "similar" (in terms of confidence) entity. Are there better options? This could be either controlled by an additional argument in get_dbpedia_uris() or maybe an option. I am not sure what constitutes good practice here.

Overlapping Entities

While in the example above, one entity is part of another, there are other examples in which the annotations merely overlap. I found an example for this in a speech by Angela Merkel (PlPr 16/46, page 4479; https://dserver.bundestag.de/btp/16/16046.pdf; abbreviated for this example):

"Die Mauer fiel

In this example, DBpedia Spotlight identifies two entities: "Die Mauer" and "Mauer fiel". They are both referring to the same URI. See the following chunk:

doc <- "Die Mauer fiel"

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint") # German endpoint
)

Similar to the issue above, if we would only count the number of URIs, the number of references to "Berliner Mauer" would be overestimated as it is counted twice although the term only occurs once.

Here, resolving these overlapping entities to one annotation seems to be more complicated than above: Which one is the more correct one? Combining both entities, the entity would be "Die Mauer fiel" which might be artificial. It would also be possible to reduce the entity to the tokens occurring in both overlapping spans (i.e. "Mauer"). Might this be more appropriate? This would be applicable to the embedded entities above, but does this always work as expected?

Interestingly, as_subcorpus() in combination with read() seems to work just fine (at least as long as the URI is the same for both parts of the overlap):

sc <- corpus("GERMAPARL2") |>
  subset(speaker_name == "Angela Merkel") |>
  subset(protocol_date == "2006-09-06") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date",
              gap = 50) |>
  _[[1]]

speech_annotation <- get_dbpedia_uris(
  x = sc,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20,
  api = getOption("dbpedia.endpoint"), # German endpoint
  verbose = FALSE,
  expand_to_token = TRUE
)

read(sc,
     annotation = as_subcorpus(speech_annotation))

Overlapping Entities with the same starting position

This is a specific case of the first variation of the issue: It is possible that an entity is embedded in another entity but they both share the same starting position. In the following example (taken from a speech by Heinrich von Brentano in the Bundestag; PlPr. 3/118 page 6801; https://dserver.bundestag.de/btp/03/03118.pdf), this becomes apparent:

doc <- "Ölbild Kaiser Wilhelms I."

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint")
)

In this case, "Kaiser Wilhelms I." and "Kaiser" are both annotated as entities. They also have different URIs assigned to them.

Since this also results in warnings when applied to CWB corpora, I will create a separate issue for this scenario.

Possible Solution

Assuming that overlapping entities might not be encoded, it becomes necessary to determine how to handle these overlaps. What I can imagine is an option or an argument that states whether the shortest (or the actual overlapping token?), the longest or the most similar entity should be kept. This terminology of "longest" and "shortest" is somewhat inspired by the CWB manual for CQP queries - it probably should be checked how this is handled in other tools as well.

In the examples above, this would mean something like

Entity Shortest / Overlapping Entity Longest Entity Most Similar Entity
[Der Deutsche [Bundestag]] Bundestag Der Deutsche Bundestag Der Deutsche Bundestag
[Die [Mauer] fiel] Mauer Die Mauer fiel Mauer fiel
[[Kaiser] Wilhelms I.] Kaiser Kaiser Wilhelms I. Kaiser Wilhelms I.

Notes:

Discussion

The question is how this behavior should be handled.

ChristophLeonhardt commented 8 months ago

To provide an update: Our current line of reasoning is that get_dbpedia_uris() should return all found entity annotations. Overlaps should be filtered afterwards. This is drafted in detect_overlap() and categorize_overlap().