PolMine / dbpedia

R Wrapper for Corpus Annotation with DBpedia Spotlight
3 stars 0 forks source link

Unexpected return values of `get_dbpedia_uris()` if `s_attribute` is NULL #18

Closed ChristophLeonhardt closed 11 months ago

ChristophLeonhardt commented 12 months ago

In the following example, I simply take the first paragraph of the GERMAPARL2 corpus and send it to DBpedia Spotlight (running in Docker). When I use the named entities encoded in the corpus, this works as expected. When I use the default for s_attribute, i.e. s_attribute = NULL, the return value looks odd. There are a lot of rows and each token has a number of identical annotations and corpus positions.

Here is the example:

library(polmineR)
library(dbpedia) # v0.1.1.9002

# get first paragraph
paragraph <- corpus("GERMAPARL2") |>
  subset(protocol_date == "1949-09-07") |>
  split(s_attribute = "p", value = FALSE) |>
  _[[1]]

p_annotated_with_ne <- get_dbpedia_uris(
  x = paragraph,
  language = "de",
  s_attribute = "ne_type",
  verbose = interactive()
)

p_annotated_without_ne <- get_dbpedia_uris(
  x = paragraph,
  language = "de",
  s_attribute = NULL,
  verbose = interactive()
)

(See one of the answers in https://stackoverflow.com/questions/67799890/column-name-equivalent-for-r-base-pipe for the hint concerning the "_")

In the source code of get_dbpedia_uris() I see that everything looks as expected until if (is.null(s_attribute)){ ..., i.e. links and dt contain a reasonable number of rows.

I think, the comparison between links and dt is a bit off. I would assume that in the second element of the comparison, the dt is missing. So maybe (!) it would suffice to change

https://github.com/PolMine/dbpedia/blob/01c5d91c9498a3f4b45f680bcc72beaad9d79d34/R/dbpedia.R#L270-L278

to something like

tab <- links[,
                 list(
                   cpos_left = dt[.SD[["start"]] == dt[["start"]]][["id"]],
                   cpos_right = dt[.SD[["end"]] == dt[["end"]]][["id"]],
                   dbpedia_uri = .SD[["dbpedia_uri"]],
                   text = .SD[["text"]]
                 ),
                 by = "start",
                 .SDcols = c("start", "end", "dbpedia_uri", "text")
    ]

(note the additional dt instead of the .SD in the comparison)

I think that the result looks reasonable, but I did not double check this yet.

ablaette commented 11 months ago

Yep. Perfect solution for the bug.