PolMine / LinkTools

R package with tools for data linkage
GNU General Public License v3.0
0 stars 0 forks source link

Use functionality of DBpedia Spotlight #13

Open ablaette opened 1 year ago

ablaette commented 1 year ago

See this as an entry point: https://github.com/dbpedia-spotlight/spotlight-docker

Alternative: https://opentapioca.org/ (without docker)

ablaette commented 1 year ago

The docker containers are not available for all architectures. Particularly not for M1. Therefore, we need to build from the dockerfile as follows.

git clone https://github.com/dbpedia-spotlight/spotlight-docker.git
cd spotlight-docker
docker build -t dbpedia/dbpedia-spotlight:latest

Then run the image as follows.

docker run -tid --restart unless-stopped --name dbpedia-spotlight.de --mount source=spotlight-model,target=/opt/spotlight -p 2222:80 dbpedia/dbpedia-spotlight spotlight.sh de
ablaette commented 1 year ago

This is a snippet to pass data to DBpedia Spotlight using classes from the NLP package. Two alerts:

library(polmineR)
use("polmineR")

merkel_speeches <- corpus("GERMAPARLMINI") %>% 
  subset(speaker == "Angela Dorothea Merkel") %>%
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "date")

doc <- as(merkel_speeches[[2]], "AnnotatedPlainTextDocument")

y <- httr::GET(
  url = "http://localhost:2222/rest/annotate",
  body = list(
    "data-urlencode" = sprintf("text=%s", doc[["content"]]),
    "data" = "confidence=0.35"
  ),
  httr::accept_json()
)
ablaette commented 1 year ago

I would have hoped that offset positions of input and output correspond, but that does not seem to be the case:

library(jsonlite)

merkel_speeches <- corpus("GERMAPARLMINI") %>% 
  subset(speaker == "Angela Dorothea Merkel") %>%
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "date")

doc <- as(merkel_speeches[[2]], "AnnotatedPlainTextDocument")

request <- httr::GET(
  url = "http://localhost:2222/rest/annotate",
  query = list(
    text = substr(doc[["content"]], 1, 990),
    confidence = 0.35
  ),
  httr::add_headers('Accept' = 'application/json')
)

# Output
httr::content(request, as = "text") %>%
  jsonlite::fromJSON() %>%
  pluck("Resources") %>%
  head() %>%
  .[, c("@surfaceForm", "@offset")]

# Input
as.data.frame(doc[["annotation"]]) %>% 
  as_tibble() %>%
  mutate(word = sapply(features, `[[`, "word")) %>%
  mutate(pos = sapply(features, `[[`, "pos")) %>%
  select(-features) %>%
  head()