In the following example, I simply take the first paragraph of the GERMAPARL2 corpus and send it to DBpedia Spotlight (running in Docker). When I use the named entities encoded in the corpus, this works as expected. When I use the default for s_attribute, i.e. s_attribute = NULL, the return value looks odd. There are a lot of rows and each token has a number of identical annotations and corpus positions.
Here is the example:
library(polmineR)
library(dbpedia) # v0.1.1.9002
# get first paragraph
paragraph <- corpus("GERMAPARL2") |>
subset(protocol_date == "1949-09-07") |>
split(s_attribute = "p", value = FALSE) |>
_[[1]]
p_annotated_with_ne <- get_dbpedia_uris(
x = paragraph,
language = "de",
s_attribute = "ne_type",
verbose = interactive()
)
p_annotated_without_ne <- get_dbpedia_uris(
x = paragraph,
language = "de",
s_attribute = NULL,
verbose = interactive()
)
In the source code of get_dbpedia_uris() I see that everything looks as expected until if (is.null(s_attribute)){ ..., i.e. links and dt contain a reasonable number of rows.
I think, the comparison between links and dt is a bit off. I would assume that in the second element of the comparison, the dt is missing. So maybe (!) it would suffice to change
In the following example, I simply take the first paragraph of the
GERMAPARL2
corpus and send it to DBpedia Spotlight (running in Docker). When I use the named entities encoded in the corpus, this works as expected. When I use the default for s_attribute, i.e.s_attribute = NULL
, the return value looks odd. There are a lot of rows and each token has a number of identical annotations and corpus positions.Here is the example:
(See one of the answers in https://stackoverflow.com/questions/67799890/column-name-equivalent-for-r-base-pipe for the hint concerning the "_")
In the source code of
get_dbpedia_uris()
I see that everything looks as expected untilif (is.null(s_attribute)){ ...
, i.e.links
anddt
contain a reasonable number of rows.I think, the comparison between
links
anddt
is a bit off. I would assume that in the second element of the comparison, thedt
is missing. So maybe (!) it would suffice to changehttps://github.com/PolMine/dbpedia/blob/01c5d91c9498a3f4b45f680bcc72beaad9d79d34/R/dbpedia.R#L270-L278
to something like
(note the additional
dt
instead of the.SD
in the comparison)I think that the result looks reasonable, but I did not double check this yet.