Apply detect() on readable PDF files

simonschoe commented 3 years ago

Hi there, from the docs I infere that detect() operates, for example, on PIL.Image objects. Is there way to directly operate on already readable PDF files (which obviates the need applying OCR as well). Greetings

lolipopshock commented 3 years ago

Thanks for pointing out. Providing support for PDF is among one of our future plans. And in the meantime, you could try to use the pdf2image library to firstly convert the PDF to some image scans and perform layout detection on them. You might want to go to their Github homepage for a detailed installation instruction: https://github.com/Belval/pdf2image .

simonschoe commented 3 years ago

@lolipopshock thanks for the response! Thought about this workflow as well, however, I assumed this approach would ignore the fact that PDF files are already run through OCR beforehand (i.e. readable). I find it somewhat cumbersome to first convert a readable PDF to an image, only to then re-apply OCR...

lolipopshock commented 3 years ago

You might want to check the pdfplumber library and here is some starter code for you:


import pdfplumber
from typing import List, Union, Dict, Any, Tuple

def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
    words = cur_page.extract_words(
                x_tolerance=1.5,
                y_tolerance=3,
                keep_blank_chars=False,
                use_text_flow=True,
                horizontal_ltr=True,
                vertical_ttb=True,
                extra_attrs=["fontname", "size"],
    )

    return lp.Layout([       
                lp.TextBlock(
                    lp.Rectangle(float(ele['x0']), float(ele['top']), 
                                 float(ele['x1']), float(ele['bottom'])),
                    text=ele['text'],
                    id = idx
                ) for idx, ele in enumerate(words)]
            )

plumber_pdf_object = pdfplumber.open(pdf_path)

all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
    cur_page = plumber_pdf_object.pages[page_id]
    tokens = obtain_word_tokens(cur_page)
    all_pages_tokens.append(tokens)

lkluo commented 3 years ago

I am thinking of combining Poppler and this layout parser to extract structured paragraph from readable PDFs, such that both advantages from the two can be leveraged. Has anyone done that?

simonschoe commented 3 years ago

You might want to check the pdfplumber library and here is some starter code for you:

import pdfplumber
from typing import List, Union, Dict, Any, Tuple

def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
    words = cur_page.extract_words(
                x_tolerance=1.5,
                y_tolerance=3,
                keep_blank_chars=False,
                use_text_flow=True,
                horizontal_ltr=True,
                vertical_ttb=True,
                extra_attrs=["fontname", "size"],
    )

    return lp.Layout([       
                lp.TextBlock(
                    lp.Rectangle(float(ele['x0']), float(ele['top']), 
                                 float(ele['x1']), float(ele['bottom'])),
                    text=ele['text'],
                    id = idx
                ) for idx, ele in enumerate(words)]
            )

plumber_pdf_object = pdfplumber.open(pdf_path)

all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
    cur_page = plumber_pdf_object.pages[page_id]
    tokens = obtain_word_tokens(cur_page)
    all_pages_tokens.append(tokens)

You might want to check the pdfplumber library and here is some starter code for you:

import pdfplumber
from typing import List, Union, Dict, Any, Tuple

def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
    words = cur_page.extract_words(
                x_tolerance=1.5,
                y_tolerance=3,
                keep_blank_chars=False,
                use_text_flow=True,
                horizontal_ltr=True,
                vertical_ttb=True,
                extra_attrs=["fontname", "size"],
    )

    return lp.Layout([       
                lp.TextBlock(
                    lp.Rectangle(float(ele['x0']), float(ele['top']), 
                                 float(ele['x1']), float(ele['bottom'])),
                    text=ele['text'],
                    id = idx
                ) for idx, ele in enumerate(words)]
            )

plumber_pdf_object = pdfplumber.open(pdf_path)

all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
    cur_page = plumber_pdf_object.pages[page_id]
    tokens = obtain_word_tokens(cur_page)
    all_pages_tokens.append(tokens)

Hm, indeed that would help to extract the actual text, token-by-token. However, if I got it correctly, this approach would still require to use something like the pdf2image library to convert the PDF to images to detect actual contiguous text blocks and then match the word-level text blocks to the Layout object respectively the page-level TextBlocks detected by one of the layout detection models (to not incorporate words/tokens that correspond to figures or tables). I am still not quite sure with respect to the optimal pipeline. Do you expect to implement the PDF-functionality in the near future? Otherwise, I would proceed with one of the "dirtier" workflows proposed in this thread.

EDIT: I started implementing the following workflow:

Convert PDF-pages to images using pdf2image
Run layout detection model for each page-image using layoutparser
Extract word tokens for each PDF page (as proposed above) using pdfplumber
Check if word token coordinates are inside (considering a soft_margin) the text_blocks detected in 2.
Merge word token for each text_block according to their id to produce contiguous string.

This raised a new issue, however. That is, the coordinates (x1, x2, y1, y2) assigned by pdfplumber are on a different scale compared to the coordinates detected in 2. which are in turn reliant on the dpi when using convert_from_path in the pdf2image library I assume...

eest9 commented 3 years ago

Hi there, you probably should add an additional step between 2. and 3. in order to check if there is any text embedded inside the boundaries of the text_block. Because you may have to fall back to use OCR when there is no text embedded. I'm currently working on a project where I'll need a structured output of a handful of PDFs in different qualities which some of them will consist of text and others just are filled with images from scanned documents.

zthab commented 3 years ago

Something like tabula https://github.com/tabulapdf/tabula might be helpful? Although this is a bit applied to tables, it does leverage the structure of the pdf to pull tables.

gevezex commented 3 years ago

This raised a new issue, however. That is, the coordinates (x1, x2, y1, y2) assigned by pdfplumber are on a different scale compared to the coordinates detected in 2. which are in turn reliant on the dpi when using convert_from_path in the pdf2image library I assume...

May I ask how you solved this scale issue. I also have different x,y coordinates for the text of pdfplumber and the layout boxes of LayoutParser.

UPDATE: Looks like the default dpi of pdf2image is 200. pdfplumber uses default 72 dpi. So if you multiply (or divide, depends from your perspective) de coordinates with float 200/72 it seems to solve the issue. The thing is should you always rely on that 72 dpi for every pdf? Couldn't find out how you can enforce 200 dpi for pdfplumber and is of course off topic for this issue.

simonschoe commented 3 years ago

Hi @gevezex

May I ask how you solved this scale issue.

I must admit that I simply dropped the project for now. I found it way too tedious worfklow, hoping the team will add some support for already readable PDF-files in the future

gevezex commented 3 years ago

Solved it like this with PyMuPdf (pip install pymupdf). I hope it can help someone with the same issue. Check also the pymupdf utility for retrieving text out of certain box coordinate


# function for rescaling xy coordinates
def scale_xy(textblock, scale=72/200):
    x1 = textblock.block.x_1 * scale
    y1 = textblock.block.y_1 * scale
    x2 = textblock.block.x_2 * scale
    y2 = textblock.block.y_2 * scale
    return (x1,y1,x2,y2)

# Using PyMuPdf for retrieving text in a bounding box
import fitz  # this is pymupdf

# Function for retrieving the tokens (words). See pymupdf utilities
def make_text(words):
    """Return textstring output of get_text("words").
    Word items are sorted for reading sequence left to right,
    top to bottom.
    """
    line_dict = {}  # key: vertical coordinate, value: list of words
    words.sort(key=lambda w: w[0])  # sort by horizontal coordinate
    for w in words:  # fill the line dictionary
        y1 = round(w[3], 1)  # bottom of a word: don't be too picky!
        word = w[4]  # the text of the word
        line = line_dict.get(y1, [])  # read current line content
        line.append(word)  # append new word
        line_dict[y1] = line  # write back to dict
    lines = list(line_dict.items())
    lines.sort()  # sort vertically
    return "\n".join([" ".join(line[1]) for line in lines])

# Open your pdf in pymupdf
pdf_doc = fitz.open('/location/to/your/file.pdf')
pdf_page4 = pdf_doc[3]  # this wil retrieve for example page 4 
words = pdf_page4.get_text("words")

# Get one of your inferenced TextBlocks what is detected with your model (model LayoutParser)
# In the doc it was called "layout". So will use that one
# first recognized bounding box:  layout[0]
# When I print my pdf version the output of layout[0] is like this:
>>> TextBlock(block=Rectangle(x_1=104.882, y_1=133.696, x_2=124.79, y_2=147.696), text=Het, id=0, type=None, parent=None, next=None, score=None)

# Rescale the coordinates
new_coordinates = scale_xy(layout[0])

# Create a Rect object for fitz (similar to TextBlock for the bounding box coordinates)
rect = fitz.Rect(*new_coordinates)

# Now we can find and print all the tokens in the bounding box:
mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]

print("\nSelect the words intersecting the rectangle")
print("-------------------------------------------")
print(make_text(mywords))

Sorry for the confusion of terminologies. I am still learning pdf related stuff.

lolipopshock commented 3 years ago

You might also want to refer to the PDF Parsers that I've implemented in another project recently -- https://github.com/allenai/VILA/blob/master/src/vila/pdftools/pdfplumber_extractor.py @ https://github.com/allenai/VILA/pull/6 . It should provide similar functionalities and readily applicable to the layout-parser library as well. I will merge the PDF parsers in the layout-parser library soon.

lolipopshock commented 3 years ago

See #71 and #72

Layout-Parser / layout-parser

Apply detect() on readable PDF files #29