clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
31 stars 5 forks source link

Build Status Maven Central

pdf2txt

The pdf2txt project combines interfaces to a number of PDF to text converters with text preprocessors that refine the converted text for use in further NLP applications.

Contents

  1. Library
  2. Executable
  3. PDF Converters
  4. Preprocessors
  5. Language Models
  6. Command Line Syntax
  7. Memory

Library

This project has been published to maven central and can be used by sbt and other build tools as a library dependency. Include a line like this in build.sbt to incorporate the main project along with all the subprojects:

libraryDependencies += "org.clulab" %% "pdf2txt" % "1.1.2"

Executable

The main Pdf2txtApp can be run directly from the pre-built jar file. The only prerequisite is Java. Startup is significantly quicker than when it runs via sbt.

PDF Converters

The PDF converters are divided into two categories. Some converters work locally, with no network connection needed, while others depend on remote servers to perform the conversion. The default is the local tika converter:

Preprocessors

Preprocessors can be configured on (true) and off (false) as shown later, but they are by default applied in the order given here. That can be changed if the project is used as a library, since it is an (ordered) array of preprocessors that gets passed around. Because actions of one preprocessor can affect how the next might work or the previous might have worked, the list is traversed multiple times until the output no longer changes.

The preprocessor unit tests include illustrative examples of transformations.

Language Models

The primary reponsibility of the language models is to determine whether word "parts" should be joined so that a word is whole again. The parts may have resulted from spaces or hyphens having been inserted between characters of a word. The programming interface looks like this:

def shouldJoin(left: String, right: String, prevWords: Seq[String]): Boolean

It decides whether a sentence starting "Wordone wordtwo left right" is OK or should have been "Wordone wordtwo leftright". This might be calculated based on something like

P(Wordone wordtwo leftright | Wordone wordtwo) > P(Wordone wordtwo left | Wordone wordtwo)

or even

P(leftright) > P(left)

The language models below are currently available. Both the gigaword and glove use not only vocabulary from their respective dictionaries, but dynamically add to it words from the document they are currently processing. A novel word such as a product or brand name that is seen without a hyphen in a document can be used to de-hyphenate other instances in the document.

A HuggingFace language model is also anticipated.

Command Line Syntax

Although this project is intended more as a library, there are several command line applications included. Many read all the PDF files in an input directory, convert them to text, preprocesses them for potential use with other NLP projects, and then write them to an output directory. They differ mainly in which component converts the PDF to text. Pdf2txtApp should be noted in particular, since it is the most encompassing. Here are highlights from its help text.

Syntax

From the command line with sbt and having the git repo, use

sbt "run <arguments>"

or from the command line after having run "sbt assembly" and changed directories (target/scala-2.12) or after having downloaded the jar file,

java -jar pdf2txt.jar <arguments>

Examples

<no_arguments>

converts all PDFs in the current directory to text files.

-in ./pdfs -out ./txts

converts all PDFs in ./pdfs to text files in ./txts.

-converter pdftotext -wordBreakBySpace false -in doc.pdf -out doc.txt

converts doc.pdf to doc.txt using pdftotxt without the wordBreakBySpace preprocessor.

-converter text -in file.txt -out file.out.txt

preprocesses file.txt resulting in file.out.txt

To get the full help text, use -h, -help, or --help.

Memory

This software uses lots of memory for multiple large neural network models and dictionaries. It may not run on machines with less than 16GB of memory, particulary with ScienceParse, and even then, settings may need to be adjusted so that the memory available can also be used. If you encounter errors indicating memory exhaustion, such as

[error] ## Exception when compiling 44 sources to /clulab/pdf2txt-project/pdf2txt/target/scala-2.11/classes
[error] java.lang.OutOfMemoryError: Java heap space

or

Exception in thread "ModelLoaderThread" java.lang.OutOfMemoryError: Java heap space

then here are some tips to try:

In each case adjust the number before the g (gigabytes) as needed.

Please note that the startup messages from fatdynet that are printed to stderr like the ones below are normal and not indicative of a problem.

[error] [dynet] Checking /home/user/pwd for libdynet_swig.so...
[error] [dynet] Checking /home/user for libdynet_swig.so...
[error] [dynet] Extracting resource libdynet_swig.so to /tmp/libdynet_swig-8897097308525612384.so...
[error] [dynet] Loading DyNet from /tmp/libdynet_swig-8897097308525612384.so...
[error] [dynet] random seed: 2522620396
[error] [dynet] allocating memory: 512MB
[error] [dynet] memory allocation done.