VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
17.91k stars 1.03k forks source link

Marker

Marker converts PDF to markdown quickly and accurately.

How it works

Marker is a pipeline of deep learning models:

It only uses models where necessary, which improves speed and accuracy.

Examples

PDF Type Marker Nougat
Think Python Textbook View View
Think OS Textbook View View
Switch Transformers arXiv paper View View
Multi-column CNN arXiv paper View View

Performance

Benchmark overall

The above results are with marker and nougat setup so they each take ~4GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Commercial usage

I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Hosted API

There's a hosted API for marker available here:

Community

Discord is where we discuss future development.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

Installation

You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install marker-pdf

Optional: OCRMyPDF

Only needed if you want to use the optional ocrmypdf as the ocr backend. Note that ocrmypdf includes Ghostscript, an AGPL dependency, but calls it via CLI, so it does not trigger the license provisions.

See the instructions here

Usage

First, some configuration:

Interactive App

I've included a streamlit app that lets you interactively try marker with some basic options. Run it with:

pip install streamlit
marker_gui

Convert a single file

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10 

The list of supported languages for surya OCR is here. If you need more languages, you can use any language supported by Tesseract if you set OCR_ENGINE to ocrmypdf. If you don't need OCR, marker can work with any language.

Convert multiple files

marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10
{
  "pdf1.pdf": {"languages": ["English"]},
  "pdf2.pdf": {"languages": ["Spanish", "Russian"]},
  ...
}

You can use language names or codes. The exact codes depend on the OCR engine. See here for a full list for surya codes, and here for tesseract.

Convert multiple files on multiple GPUs

METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out

Note that the env variables above are specific to this script, and cannot be set in local.env.

Use from python

See the convert_single_pdf function for additional arguments that can be passed.

from marker.convert import convert_single_pdf
from marker.models import load_all_models

fpath = "FILEPATH"
model_lst = load_all_models()
full_text, images, out_meta = convert_single_pdf(fpath, model_lst)

Output format

The output will be a markdown file, but there will also be a metadata json file that gives information about the conversion process. It has these fields:

{
    "languages": null, // any languages that were passed in
    "filetype": "pdf", // type of the file
    "pdf_toc": [], // the table of contents from the pdf
    "computed_toc": [], //the computed table of contents
    "pages": 10, // page count
    "ocr_stats": {
        "ocr_pages": 0, // number of pages OCRed
        "ocr_failed": 0, // number of pages where OCR failed
        "ocr_success": 0,
        "ocr_engine": "none"
    },
    "block_stats": {
        "header_footer": 0,
        "code": 0, // number of code blocks
        "table": 2, // number of tables
        "equations": {
            "successful_ocr": 0,
            "unsuccessful_ocr": 0,
            "equations": 0
        }
    }
}

API server

There is a very simple API server you can run like this:

pip install -U uvicorn fastapi python-multipart
marker_server --port 8001

This will start a fastapi server that you can access at localhost:8001. You can go to localhost:8001/docs to see the endpoint options.

Note that this is not a very robust API, and is only intended for small-scale use. If you want to use this server, but want a more robust conversion option, you can run against the hosted Datalab API. You'll need to register and get an API key, then run:

marker_server --port 8001 --api_key API_KEY

Note: This is not the recommended way to use the Datalab API - it's only provided as a convenience for people wrapping the marker repo. The recommended way is to make a post request to the endpoint directly from your code vs proxying through this server.

You can send requests like this:

import requests
import json

post_data = {
    'filepath': 'FILEPATH',
    # Add other params here
}

requests.post("http://localhost:8001/marker", data=json.dumps(post_data)).json()

Troubleshooting

There are some settings that you may find useful if things aren't working the way you expect:

In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.

Debugging

Set DEBUG=true to save data to the debug subfolder in the marker root directory. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.

Useful settings

These settings can improve/change output quality:

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.

Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

Method Average Score Time per page Time per document
marker 0.613721 0.631991 58.1432
nougat 0.406603 2.59702 238.926

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Method multicolcnn.pdf switch_trans.pdf thinkpython.pdf thinkos.pdf thinkdsp.pdf crowd.pdf
marker 0.536176 0.516833 0.70515 0.710657 0.690042 0.523467
nougat 0.44009 0.588973 0.322706 0.401342 0.160842 0.525663

Peak GPU memory usage during the benchmark is 4.2GB for nougat, and 4.1GB for marker. Benchmarks were run on an A6000 Ada.

Throughput

Marker takes about 4GB of VRAM on average per task, so you can convert 12 documents in parallel on an A6000.

Benchmark results

Running your own benchmarks

You can benchmark the performance of marker on your machine. Install marker manually with:

git clone https://github.com/VikParuchuri/marker.git
poetry install

Download the benchmark data here and unzip. Then run the overall benchmark like this:

python benchmarks/overall.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

Thank you to the authors of these models and datasets for making them available to the community!