A CLI (command line interface) to Extract text from PDF files. Use from your terminal to dump a PDF file text to the std output. Options exists to output to file, choose pages range etc.
Usage: TextExtraction.exe filepath <option(s)>
filepath - pdf file path
Options:
-s, --start <d> start text extraction from a page index. use negative numbers to subtract from pages count
-e, --end <d> end text extraction upto page index. use negative numbers to subtract from pages count
-b, --bidi <RTL|LTR> use bidi algo to convert visual to logical. provide default direction per document writing direction.
-p, --spacing <BOTH|HOR|VER|NONE> add spaces between pieces of text considering their relative positions. default is BOTH
-t, --tables extract tables instead of text. Each table is represented in CSV
-o, --output /path/to/file write result to output file (or files for tables export)
-q, --quiet quiet run. only shows errors and warnings
-h, --help Show this help message
-d, --debug /path/to/file create debug output file
New with 1.1.5 - binaries are avaialable for download in the Releases section of the repo.
New it is now also possible to use this CLI to extract tables. This is still experimental due to how tables may be represented in many multiple ways, but with enough samples the code can be upgraded to be more able. When asking for table extraction only tables are output as CSV. std output will show CSV content of the PDF tables. When outputting to files each file will contain a single table. The output file name is the first table output file, where later tables file names will use the file name as base file name along with an ordinal (starting from 1).
This is a C++ Project using CMake as project builder. To build/develop You will need:
Once you installed pre-reqs, you can now build the project.
To build you project start by creating a project file in a "build" folder off of the cmake configuration, like this:
mkdir build
cd build
cmake ..
Note that at this point the process will look for PDFHummus package. If not found locally it will download it from its repo. So internet connection is what you want there.
Once you got the project file, you can now build the project. If you created an IDE file, you can use the IDE file to build the project. Alternatively you can do so from the command line, again using cmake.
The following builds the project from its root folder:
cmake --build build --config release
This will build the project inside the build folder. You will be able to look up the result execultable per how you normally do when building with the relevant build environment. For example, for windows, the TextExtractionCLI/Release
folder will have the result exectuable named TextExtraction
.
The project builds both the cli executable and a dependency lib. The lib can be used in another project for PDF text extraction, and the CLI code is a good example of how it can be used.
If you want, you can use the "install" verb of cmake to install a built product. Use the prefix param to specify where you want the result to be installed to
cmake --install ./build --prefix ./etc/install --config release --component executables
This will install the TextExtraction executable in ./etc/install. To install the CLI and other dependent libs drop the --component executables
part.
if you do not have cmake --install
as option, you can use a regular build with install target instead, and specify the install target in configuration stage, like this:
cd build
cmake .. -DCMAKE_INSTALL_PREFIX="../etc/install"
cd ..
cmake --build build/TextExtractionCLI --config release --target install
This project uses ctest for running tests. ctest is part of cmake and should be installed as part of cmake installation. To run the project tests (after having created the project files in ./build) go:
ctest --test-dir build -C release
This should scan the folders for tests and run them.
The cmake project defines TextExtraction as a Package. There are 2 targets to this package:
The TextExtraction
is a lib that you can use in your own project to extract text. You can read the CLI code as a useful example on how to use the lib.
The TextExtractionCLI
is the CLI part, which you can use as target as well.
In your project cmakefile you can import the project like a regular package:
find_package (TextExtraction)
target_link_libraries(MyTarget TextExtraction::TextExctraction)
If you are developing this project using vscode here's some suggestions to help you:
This should help you enable testing and debugging in vscode. Specifically you can debug the TextExtrction CLI with the (lldb) launch
debug target, and the tests are debuggable as well.
The end result is an executable, so just run it from comman line (it's a regular cli).
The minimal run requires a file path to a PDF from which you would like read the text, say on windows:
etc\install\bin\TextExtraction.exe sample.pdf
PDF files contain text as drawing instructions. As a result what's being parsed is per the visual order of text. This doesn't matter much if your text is latin, or wholly left to right. However when the PDF has right to left text, either by itself or combined with left-to-right text or even numbers, the parsed text will appear to be reversed, or otherwise disorganized. To take care of this there is support for Bidi reversal algorithm. This algorithm is implemented in ICU library, and this executable will use it if instructed so, and if ICU library is available.
BIDI conversion is turned off by default, as it does carry some performance price, however you can unlock it by using the USE_BIDI configuration variable. When calling cmake
for congiruation, add -DUSE_BIDI=1
. like this:
# only if you didnt create build lib yet
mkdir build
# then...
cd build
cmake .. -DUSE_BIDI=1
the module code does not come with ICU library pre-bundled with the code, so it will attempt to install it and if succesful, BIDI conversion will be supported. You can tell that BIDI conversion is supported by checking the help text of TextExtraction
. If it shows the -b, --bidi <RTL|LTR>
option, then it is available.
ICU Library installation process will try the following:
brew install icu4c
../TextExtraction/CMakeLists.txt
to try and make it work. there are pointers there for info.When parsing for tables the final output is CSV. CSVs can't handle split cells (normally found in the header, there'd be a single cell spanning multiple cells and then internally there'd be a split providing the individual columns headers names) so it's not important to parse internal columns/rows of a cell. However for the sake of excercise, and if anyone wants to output this to Excel/Google Sheets/Numbers where split cells are a reality, I did program internal cell parsing for table structure which would provide the relevant info. It's off by default, and you can use the SHOULD_PARSE_INTERNAL_TABLES configuratin variable to turn it on. This would mean the CellInRow
struct might have a non null internalTable, that is - when one such exists. when calling cmake for configuration, add -DSHOULD_PARSE_INTERNAL_TABLES=1
to get the parsing going.
If you want to use the text extraction capabilities in your own software, skip the extract-text-cli.cpp
and using TextExtraction
class directly. you provide it with a file path in ExtractText()
and later can pick up the results in GetResultsAsText()
. Modify it to your needs if you have other forms of desired output. The internal structure textsForPages
allows you to be more flexible as to what you do with the text, and you can use GetResultsAsText
as a reference implementation.
As for tables extraction, the class TableExtraction
might be of use. It's ExtractTables()
method gets the same paraps as the text extraction ExtractText()
and the results will be placed in tablesForPages
data structure. To get CSV output you can either use GetAllAsCSVText
which returns a single string of all tables CSV representaitons concatenated...or a more useful GetTableAsCSVText
which
gets a single Table construct from tablesForPages
and returns a CSV representation for it.
You are also welcome to use the PDFRecursiveInterpreter
directly for any content intrepretation needs you may have.
License is Apache2, and provided here
This text extraction algorithm is based on a previous Javascript based implementation that was described here - https://pdfhummus.com/post/156548561656/extracting-text-from-pdf-files. Most limitations stated there are true to this implementation:
This implementation has a few enhancments on top of the original:
Tables parsing is based on the very few samples I tried, so it's probably quite limited at this point. The tables parsing reuses the text intepretation of the base text extraction algorithm as well as attempting to locate vertical and horizontal lines to determine tables based on them. Vertical and horizontal lines are then grouped to tables based on whether they have intersection relationships (direct or indirect by instersecting with lines that in turn intersect etc.) accounting for lines that only split cells and are not 100% column/row lines. It then attempts to determine rows and cells in those rows. Then based on the text placements locations it posits them in their right cells.
This implementation is based on hummus PDF library. Specifically it uses the parsing capabilities of hummus to interpret the pages content and understand things like lines and texts.
Both TextExtraction
and TableExtraction
run through interpretation of pages content to extract relevant placeemnts - glyphs or glyphs and lines respectively. Then each one attempts to understand texts from glyphs and parsed font data. For tables lines are also inspected to determine horizontal and vertical lines that form tables.
The PDFRecursiveInterpreter
is used for the very basic interpretation of PDF content. it is named recursive becasue it recurses into forms placed in what page content is fed for interpreation. The interpreter launches an event to its handler every time it comes up with a content drawing operator. It provides to the handler both the operator and operand. PDFRecursiveInterpreter
can be used as is in many possible implementations involving PDF content interpreation, such as extracting content (text, images etc.) or even rendering.
The operators and operands are fed to the GraphicContentInterpreter
. This class understands specific operators and what they do. At this point it understands anything that has to do with paths and texts, to be able to support the relevant implementation for this code, but it can have more code added to it to understand much more...based on the desired implementation. In its form here it launches and event to its handler for every placed text elements and for every placed path.
The TextInterpreter
code is used to convert the text placements provided by the interpreter to actual text. The text placements only contain glyph information and local graphic state, and the TextInterpreter
adds font data to determine texts from the glyphs and their position in the page. Upon completing translating a text placement it launches its own text complete event to provide its handler with the translated and posited text element (there's a certain nuance here with respect to PDF text elements and actual text placements...which will skip in this description).
The TableComposer
code is used to build tables from collections of vertical and horizontal lines and text. Normally used at the page level it can figure out which lines map to which tables (in case there are multiple tables on the page) and which texts go into which cells. Its output is a list of tables each defining rows and cells and texts in those cells. There's quite a bit of heuristics in the whole table construction process...which is why it's a bit more at an experimental stage than the older text extraction part.
For tables there's also TableCSVExport
which exports a single Table
object build by the TableComposer
to a CSV string.