PolMine / dbpedia

R Wrapper for Corpus Annotation with DBpedia Spotlight
3 stars 0 forks source link

R package ‘dbpedia’ - wrapper for DBpedia Spotlight

License: GPL
v3 R-CMD-check codecov Lifecycle:
maturing

About

Functionality for Entity Linking from R: Get DBpedia URIs for entities in a corpus using DBpedia Spotlight.

Motivation

The method of Entity Linking is used to disambiguate entities such as Persons, Organizations and Locations in continuous text and link them to entries in an external knowledge graph. At its core, the aim of the dbpedia R package is to integrate Entity Linking with the tool “DBpedia Spotlight” (https://www.dbpedia-spotlight.org) into common workflows for text analysis in R. In particular, it addresses the following needs:

For the examples in this README, we also load some additional packages.

library(kableExtra)
library(dplyr)
library(quanteda)

First Look

The main motivation is to lower barriers to link textual data with other resources in social science research. As such, it aims to provide a focused way to interact with DBpedia Spotlight from within R. In its most basic application, the package can be used to query DBpedia Spotlight as follows:

library(dbpedia)

doc <- "Berlin is the capital city of Germany."

uri_table <- get_dbpedia_uris(x = doc,
                              language = getOption("dbpedia.lang"),
                              api = getOption("dbpedia.endpoint")
)

DBpedia Spotlight is able to identify which parts of the text represent entities and decide which resources in the knowledge graph DBpedia they correspond with. The return value of the method is a data.table containing identified entities along with their respective DBpedia URIs and starting positions in the document.

start text dbpedia_uri
1 Berlin http://de.dbpedia.org/resource/Berlin
31 Germany http://de.dbpedia.org/resource/Deutschland

Installation and Setup

At this stage, the dbpedia R package is a GitHub-only package. Install it as follows:

devtools::install_github("PolMine/dbpedia", ref = "main")

In a nutshell, the package prepares queries, sends them to an external tool and parses the returned results. This tool, DBpedia Spotlight, is running as a Web Service - either remotely or locally. The developers of DBpedia Spotlight currently maintain a public endpoint for the service which is selected by default by the dbpedia package.

Running DBpedia Spotlight locally - Docker Setup

As an alternative to the public endpoint, it is possible to run the service locally. This can be reasonable for reasons of performance, rate limits of the public endpoint and other considerations. The easiest way to realize this is to use the tool within a Docker container prepared by the maintainers of DBpedia Spotlight. The setup is described in some detail in the corresponding GitHub repository: https://github.com/dbpedia-spotlight/spotlight-docker.

As described on the GitHub page, with Docker running, the quick-start command to be used in the terminal to load and run a DBpedia Spotlight model is as follows:

docker run -tid \
  --restart unless-stopped \
  --name dbpedia-spotlight.de \
  --mount source=spotlight-model,target=/opt/spotlight \
  -p 2222:80  \
  dbpedia/dbpedia-spotlight spotlight.sh de

This will initialize the German Docker DBpedia Spotlight model. Other available languages are described in the GitHub repository as well.

Note: In our tests, we noticed that the DBpedia Spotlight Docker containers are not available for all architectures, in particular Apple silicon. In this case, build container from the dockerfile as follows before loading the model:

git clone https://github.com/dbpedia-spotlight/spotlight-docker.git
cd spotlight-docker
docker build -t dbpedia/dbpedia-spotlight:latest .

Note: When run the first time, the script will download the language model. Depending on the language, this download and the subsequent initialization of the model will take some time. This process is not necessarily obvious in the output of the terminal. If the container is queried before the language model is fully initialized, the download or initialization of the model seems to be interrupted which will cause errors when queried later on. It is thus advisable to wait until the container is idle before querying the service the first time in a session.

Using the package - A Very Quick Warkthrough with quanteda corpora

This README will use the common quanteda corpus format as input to provide a quick step-by-step overview about the functionality provided by the package. A brief second example will illustrate how the extracted Uniform Resource Identifiers can be mapped back onto the input, using a Corpus Workbench corpus as an example.

Setup - Loading the package

Upon loading the dbpedia package, a start up message will print information about whether the DBpedia Spotlight service is running locally or if a public endpoint is used. In addition, the language of the model and the corresponding list of stop words is shown.

library(dbpedia)

This information is available during the R session and is by default used by the get_dbpedia_uris() method.

getOption("dbpedia.endpoint")
getOption("dbpedia.lang")

Data

For the following example, we use the “US presidential inaugural address texts” corpus from the quanteda R package. For illustrative purposes, only speeches since 1970 are used. To create useful chunks of text, we split the corpus into paragraphs.

inaugural_paragraphs <- data_corpus_inaugural |>
  corpus_subset(Year > 1970) |>
  corpus_reshape(to = "paragraphs")

Entity Linking with get_dbpedia_uris()

Using a local endpoint for the DBpedia Spotlight service and the sample corpus from quanteda, identifying and disambiguating entities in documents can be realized with the main worker method the package: get_dbpedia_uris().

The method accepts the data in different input formats - character vectors, quanteda corpora, Corpus Workbench format, XML - as well as additional parameters, some of which are discussed in more detail in the package’s vignette.

uritab_paragraphs <- get_dbpedia_uris(
  x = inaugural_paragraphs,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.5,
  api = getOption("dbpedia.endpoint"),
  verbose = FALSE,
  progress = FALSE
)

In this case, the text of each document in the corpus is extracted and passed to the DBpedia Spotlight service. The results are then parsed by the method. The return value is a data.table containing the document name as well as the extracted entities along with their starting position in the text and, most importantly, their respective URI in the DBpedia Knowledge Graph (only the first five entities are shown here and the column containing the types of the entities is omitted):

doc start text dbpedia_uri
1973-Nixon.1 1 Mr http://de.dbpedia.org/resource/Master_of_the_Rolls
1973-Nixon.1 21 Mr http://de.dbpedia.org/resource/Master_of_the_Rolls
1973-Nixon.1 25 Speaker http://de.dbpedia.org/resource/Speaker
1973-Nixon.1 34 Mr http://de.dbpedia.org/resource/Master_of_the_Rolls
1973-Nixon.1 38 Chief Justice http://de.dbpedia.org/resource/Chief_Justice_of_the_United_States

The package’s vignette provides some more details to the approach and parameters.

Token-Level Annotation with the Corpus Workbench

While approaches that enrich documents with entities are very useful, another important aspect of entity linking is the ability to assign URIs to precise spans within the text and write them back to the corpus. This can be crucial if extracted URIs should be used in subsequent tasks when working with textual data. The Corpus Workbench data format makes it possible to map annotated entities onto the continuous text of the initial corpus. The following quick example should illustrate this.

Data

For this example, we use a single newswire of the REUTERS corpus. The corpus is provided as a Corpus Workbench sample corpus in the RcppCWB R package. To work with CWB corpora in R, the R package polmineR is used. Both RcppCWB and polmineR are dependencies of dbpedia.

library(polmineR)
use("RcppCWB")

To extract an illustrative part of the REUTERS corpus, we create a subcorpus comprising of a single document. To do so, we use polmineR’s subset() method for CWB corpus objects.

reuters_newswire <- corpus("REUTERS") |>
  subset(id == 144)

Entity Linking with get_dbpedia_uris()

Like before, we perform Entity Linking with get_dbpedia_uris(). In addition, we map entity types returned by DBpedia Spotlight to a number of entity classes (see the vignette for a more comprehensive explanation).

mapping_vector = c(
  "PERSON" = "DBpedia:Person",
  "ORGANIZATION" = "DBpedia:Organisation",
  "LOCATION" = "DBpedia:Place"
)

reuters_newswire_annotation <- reuters_newswire |>
  get_dbpedia_uris(verbose = FALSE) |>
  map_types_to_class(mapping_vector = mapping_vector)
## ℹ mapping values in column `types` to new column `class`

This results in the following annotations (only the first five entities are shown here and the column of types is omitted):

cpos_left cpos_right dbpedia_uri text class
92 92 http://de.dbpedia.org/resource/Organisation_erdölexportierender_Länder OPEC LOCATION|ORGANIZATION|PERSON
93 93 http://de.dbpedia.org/resource/Brian_May may LOCATION|ORGANIZATION|PERSON
101 101 http://de.dbpedia.org/resource/June_Carter_Cash June LOCATION|ORGANIZATION|PERSON
102 102 http://de.dbpedia.org/resource/Session_(Schweiz) session LOCATION|ORGANIZATION|PERSON
105 105 http://de.dbpedia.org/resource/Integrated_Truss_Structure its LOCATION|ORGANIZATION|PERSON

Mapping the Results to the Corpus

This leaves us with a similar output like before. As explained in more detail in the vignette, the output of get_dbpedia_uris() for CWB objects additionally contains corpus positions of entities within the continuous text. This allows us to map the annotations back to the corpus.

polmineR’s read() method allows us to visualize this mapping interactively, using the classes of the entities to provide some visual clues as well.

read(reuters_newswire,
     annotation = as_subcorpus(reuters_newswire_annotation, highlight_by = "class"))

Advanced Scenarios

This README only offers a first look into the functions of the dbpedia package. Specific parameters as well as other scenarios are discussed in more detail in the vignette of the package. These scenarios include the integration of SPARQL queries in the workflow to further enrich disambiguated entities with additional data from the DBpedia and Wikidata knowledge graphs.

Related work

Acknowledgements

We gratefully acknowledge funding from the German National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur / NFDI). Developing the dbpedia package is part of the measure “Linking Textual Data” as part of the consortium KonsortSWD (project number 442494171).