Open ChristophLeonhardt opened 8 months ago
To provide an update: Our current line of reasoning is that get_dbpedia_uris()
should return all found entity annotations. Overlaps should be filtered afterwards. This is drafted in detect_overlap()
and categorize_overlap()
.
Issue
Occasionally, DBpedia Spotlight returns overlapping annotations.
Take the following example:
In phrases such as
(found in GermaParl) both the entities "Der Deutsche Bundestag" and "Bundestag" are annotated. They might share the same URI but do not need to. Depending on the input format, this might cause different issues. For character vectors, this at least overestimates the number of unique entities (in the example above, there is only one instance of "Bundestag" but if we count the two URIs as two instances, this would not be correct in most cases). For CWB corpora, we currently do not have a way to encode these overlapping annotations.
In this issue, I'll demonstrate three variations of overlapping entity annotations. I think that the technical solution might be similar for all three scenarios. There are conceptual aspects to be discussed. The following considerations follow the assumption that we do not want to keep overlapping annotations but resolve these to a single annotation. Other solutions could be considered here.
Embedded Annotations
In the example above, "Der Deutsche Bundestag", one entity is completely embedded in the other. This could be resolved by controlling for overlapping entities and limiting the output to either the entity included in all annotations ("Bundestag"), the longest entity ("Der Deutsche Bundestag") or, using the scores provided by DBpedia Spotlight, the most "similar" (in terms of confidence) entity. Are there better options? This could be either controlled by an additional argument in
get_dbpedia_uris()
or maybe anoption
. I am not sure what constitutes good practice here.Overlapping Entities
While in the example above, one entity is part of another, there are other examples in which the annotations merely overlap. I found an example for this in a speech by Angela Merkel (PlPr 16/46, page 4479; https://dserver.bundestag.de/btp/16/16046.pdf; abbreviated for this example):
In this example, DBpedia Spotlight identifies two entities: "Die Mauer" and "Mauer fiel". They are both referring to the same URI. See the following chunk:
Similar to the issue above, if we would only count the number of URIs, the number of references to "Berliner Mauer" would be overestimated as it is counted twice although the term only occurs once.
Here, resolving these overlapping entities to one annotation seems to be more complicated than above: Which one is the more correct one? Combining both entities, the entity would be "Die Mauer fiel" which might be artificial. It would also be possible to reduce the entity to the tokens occurring in both overlapping spans (i.e. "Mauer"). Might this be more appropriate? This would be applicable to the embedded entities above, but does this always work as expected?
Interestingly,
as_subcorpus()
in combination withread()
seems to work just fine (at least as long as the URI is the same for both parts of the overlap):Overlapping Entities with the same starting position
This is a specific case of the first variation of the issue: It is possible that an entity is embedded in another entity but they both share the same starting position. In the following example (taken from a speech by Heinrich von Brentano in the Bundestag; PlPr. 3/118 page 6801; https://dserver.bundestag.de/btp/03/03118.pdf), this becomes apparent:
In this case, "Kaiser Wilhelms I." and "Kaiser" are both annotated as entities. They also have different URIs assigned to them.
Since this also results in warnings when applied to CWB corpora, I will create a separate issue for this scenario.
Possible Solution
Assuming that overlapping entities might not be encoded, it becomes necessary to determine how to handle these overlaps. What I can imagine is an option or an argument that states whether the shortest (or the actual overlapping token?), the longest or the most similar entity should be kept. This terminology of "longest" and "shortest" is somewhat inspired by the CWB manual for CQP queries - it probably should be checked how this is handled in other tools as well.
In the examples above, this would mean something like
Notes:
resources
data.table retrieved from DBpedia Spotlight. These values can be very close.Discussion
The question is how this behavior should be handled.
get_dbpedia_uris()
?