CeON / CERMINE

Content ExtRactor and MINEr
GNU Affero General Public License v3.0
486 stars 99 forks source link
affiliation-parsing java machine-learning metadata-extraction pdf reference-parsing

Content ExtRactor and MINEr

CERMINE is a Java library and a web service (cermine.ceon.pl) for extracting metadata and content from PDF files containing academic publications. CERMINE is written in Java at Centre for Open Science at Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw.

The code is licensed under GNU Affero General Public License version 3.

How to cite CERMINE:

Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. 
CERMINE: automatic extraction of structured metadata from scientific literature. 
In International Journal on Document Analysis and Recognition (IJDAR), 2015, 
vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.

DOI of CERMINE release 1.13:

DOI

Using CERMINE

CERMINE can be used for:

In all tasks the default output format is NLM JATS.

There are three way of using CERMINE, depending on the user's needs:

Refer to one of the sections below for details.

Standalone application

The easiest way to process files on a laptop/server is using CERMINE as a standalone application. All you will need is a single JAR file containing all the tools, external libraries and learned models. The latest release can be downloaded from the repository (look for a file called cermine-impl-<VERSION>-jar-with-dependencies.jar). The current version is 1.13.

Processing PDF documents

The basic command for processing PDF files is the following:

$ java -cp cermine-impl-<VERSION>-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path path/to/directory/with/pdfs/

Additional argument -outputs can be used to specify the types of the outputs. The value should be a comma-separated list of one or more of the following:

Processing references

To extract metadata from a reference string use the following:

$ java -cp cermine-impl-<VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.bibref.CRFBibReferenceParser -reference "the text of the reference"

Processing affiliations

To extract metadata from an affiliation string use:

$ java -cp cermine-impl-<VERSION>-jar-with-dependencies.jar pl.edu.icm.cermine.metadata.affiliation.CRFAffiliationParser -affiliation "the text of the affiliation"

(OPTIONAL) if you would like to build an executable JAR yourself, clone the project and execute:

$ cd CERMINE/cermine-impl
$ mvn compile assembly:single

This will result in a file cermine-impl-<VERSION>-jar-with-dependencies.jar in cermine-impl/target directory.

Maven dependency

CERMINE can be used in Java projects by adding the following dependency and repository to the project's pom.xml file:

<dependency>
    <groupId>pl.edu.icm.cermine</groupId>
    <artifactId>cermine-impl</artifactId>
    <version>${cermine.version}</version>
</dependency>

<repository>
    <id>icm</id>
    <name>ICM repository</name>
    <url>http://maven.icm.edu.pl/artifactory/repo</url>
</repository>

Example code to extract the content from a PDF file:

ContentExtractor extractor = new ContentExtractor();
InputStream inputStream = new FileInputStream("path/to/pdf/file");
extractor.setPDF(inputStream);
Element result = extractor.getContentAsNLM();

Example code to extract metadata from a reference string:

CRFBibReferenceParser parser = CRFBibReferenceParser.getInstance();
BibEntry reference = parser.parseBibReference(referenceText);

Example code to extract metadata from an affiliation string:

CRFAffiliationParser parser = new CRFAffiliationParser();
Element affiliation = parser.parse(affiliationText);

REST service

The third possibility is to use CERMINE's REST service with cURL tool. Note, however, that this should only be used for small amounts of data, as the server does not have a lot of resources. Moreover, the web application might not use the latest code version. In most cases using the executable JAR is a better choice.

To extract the content from a PDF file:

$ curl -X POST --data-binary @article.pdf \
  --header "Content-Type: application/binary"\
  http://cermine.ceon.pl/extract.do

To extract metadata from a reference string:

$ curl -X POST --data "reference=the text of the reference" \
  http://cermine.ceon.pl/parse.do

To extract metadata from an affiliation string:

$ curl -X POST --data "affiliation=the text of the affiliation" \
  http://cermine.ceon.pl/parse.do