Alignment of entity and token spans (in CWB subcorpora)

When working with a CWB corpus, I noticed that some entities returned by DBpedia Spotlight are not properly mapped to the tokens of the corpus. The issue seems to be that the entity spans of DBpedia Spotlight do not always align with the token spans of the CWB corpus.

In particular, I noticed that get_dbpedia_uris() returns a data.table object with "NA" in the "cpos_right" column for some rows. In my observation, this concerns tokens with apostrophes. Their boundaries seem to be treated differently by DBpedia Spotlight than the tokenization within the CWB would suggest.

Example

To illustrate what happens here, consider the following example:

library(dbpedia)
txt <- "Berlin is Germany's capital city."
db_dt <- get_dbpedia_uris(txt)

It is assumed that "Germany's" is passed as one token to DBpedia Spotlight. DBpedia Spotlight identifies the character sequence "Germany" as an entity. It is returned as such with the correct "start" position.

Consequences for aligning entity spans and token spans in the CWB

This is an issue if the goal is to not only identify entities in the text but also to map them back onto pre-tokenized input data - for example a CWB tokenstream: This match is realized by comparing start and end positions of the entity and token spans. Although the entity and the corresponding token have the same starting position, their length differs and thus does their end position. Being unable to find the exact token in the tokenstream, "cpos_right" becomes "NA" in the return value of get_dbpedia_uris() when used with CWB input.

Aside from not being able to fully map the entities to tokens, this causes problems in as_subcorpus() which returns a subcorpus of unknown size. When this subcorpus is then used as annotation in polmineR's read() function, this results in an error.

Possible Solution

I think that there are potentially two elements to address this:

explicitly drop entities which cannot be matched, i.e. which have "NA" as their cpos_right
expand entity spans to match token spans

Discussion

Concerning the first issue: Currently, inexact matches are kept but due to their unknown boundaries, this annotation cannot be used to map them back to the pre-tokenized input. If this is the goal, then removing annotations with "NA" in the cpos_right column and making this explicit with a message can address this. However, there might be scenarios in which mapping entities back to individual tokens is not the goal in the first place. Then maybe I would like to keep the entities even if they are annotated on sub-token level - as I would implicitly when my input is not tokenized beforehand. This could be addressed with a logical argument.

Concerning the second point, I think, for CWB (sub)corpora which are processed without using any pre-annotated named entity spans (i.e. without the argument s_attribute), an option would be to expand the identified entity span to the end of the token in which the entity ends, if necessary. In the example above, the entity annotation of "Germany" could be expanded to the end of the token "Germany's". In CWB subcorpora, this would avoid "NA" in the cpos_right column as the character offsets would match. Preliminary testing suggests that this is feasible. However, I think that this should be regarded as an optional feature instead of the default behavior. This seems to work well for the observed issue with apostrophes but there might be cases in which this expansion does not work.

So, essentially it could be worth considering whether to introduce two additional arguments to get_dbpedia_uris() for subcorpora:

drop_inexact_annotations: A logical value - Whether to drop annotations if entity and token spans do not align exactly
expand_to_token: A logical value - Whether diverging entity spans are expanded to match the next complete token boundary

Regarding usability, too many arguments should be avoided, however.

PolMine / dbpedia