Closed simonschoe closed 3 years ago
Thanks for pointing out. Providing support for PDF is among one of our future plans. And in the meantime, you could try to use the pdf2image
library to firstly convert the PDF to some image scans and perform layout detection on them. You might want to go to their Github homepage for a detailed installation instruction: https://github.com/Belval/pdf2image .
@lolipopshock thanks for the response! Thought about this workflow as well, however, I assumed this approach would ignore the fact that PDF files are already run through OCR beforehand (i.e. readable). I find it somewhat cumbersome to first convert a readable PDF to an image, only to then re-apply OCR...
You might want to check the pdfplumber
library and here is some starter code for you:
import pdfplumber
from typing import List, Union, Dict, Any, Tuple
def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
words = cur_page.extract_words(
x_tolerance=1.5,
y_tolerance=3,
keep_blank_chars=False,
use_text_flow=True,
horizontal_ltr=True,
vertical_ttb=True,
extra_attrs=["fontname", "size"],
)
return lp.Layout([
lp.TextBlock(
lp.Rectangle(float(ele['x0']), float(ele['top']),
float(ele['x1']), float(ele['bottom'])),
text=ele['text'],
id = idx
) for idx, ele in enumerate(words)]
)
plumber_pdf_object = pdfplumber.open(pdf_path)
all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
cur_page = plumber_pdf_object.pages[page_id]
tokens = obtain_word_tokens(cur_page)
all_pages_tokens.append(tokens)
I am thinking of combining Poppler and this layout parser to extract structured paragraph from readable PDFs, such that both advantages from the two can be leveraged. Has anyone done that?
You might want to check the
pdfplumber
library and here is some starter code for you:import pdfplumber from typing import List, Union, Dict, Any, Tuple def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List: words = cur_page.extract_words( x_tolerance=1.5, y_tolerance=3, keep_blank_chars=False, use_text_flow=True, horizontal_ltr=True, vertical_ttb=True, extra_attrs=["fontname", "size"], ) return lp.Layout([ lp.TextBlock( lp.Rectangle(float(ele['x0']), float(ele['top']), float(ele['x1']), float(ele['bottom'])), text=ele['text'], id = idx ) for idx, ele in enumerate(words)] ) plumber_pdf_object = pdfplumber.open(pdf_path) all_pages_tokens = [] for page_id in range(len(plumber_pdf_object.pages)): cur_page = plumber_pdf_object.pages[page_id] tokens = obtain_word_tokens(cur_page) all_pages_tokens.append(tokens)
You might want to check the
pdfplumber
library and here is some starter code for you:import pdfplumber from typing import List, Union, Dict, Any, Tuple def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List: words = cur_page.extract_words( x_tolerance=1.5, y_tolerance=3, keep_blank_chars=False, use_text_flow=True, horizontal_ltr=True, vertical_ttb=True, extra_attrs=["fontname", "size"], ) return lp.Layout([ lp.TextBlock( lp.Rectangle(float(ele['x0']), float(ele['top']), float(ele['x1']), float(ele['bottom'])), text=ele['text'], id = idx ) for idx, ele in enumerate(words)] ) plumber_pdf_object = pdfplumber.open(pdf_path) all_pages_tokens = [] for page_id in range(len(plumber_pdf_object.pages)): cur_page = plumber_pdf_object.pages[page_id] tokens = obtain_word_tokens(cur_page) all_pages_tokens.append(tokens)
Hm, indeed that would help to extract the actual text, token-by-token. However, if I got it correctly, this approach would still require to use something like the pdf2image
library to convert the PDF to images to detect actual contiguous text blocks and then match the word-level text blocks to the Layout
object respectively the page-level TextBlocks
detected by one of the layout detection models (to not incorporate words/tokens that correspond to figures or tables). I am still not quite sure with respect to the optimal pipeline. Do you expect to implement the PDF-functionality in the near future? Otherwise, I would proceed with one of the "dirtier" workflows proposed in this thread.
EDIT: I started implementing the following workflow:
pdf2image
layoutparser
pdfplumber
soft_margin
) the text_blocks
detected in 2.text_block
according to their id
to produce contiguous string.This raised a new issue, however. That is, the coordinates (x1, x2, y1, y2) assigned by pdfplumber
are on a different scale compared to the coordinates detected in 2. which are in turn reliant on the dpi
when using convert_from_path
in the pdf2image
library I assume...
Hi there, you probably should add an additional step between 2. and 3. in order to check if there is any text embedded inside the boundaries of the text_block. Because you may have to fall back to use OCR when there is no text embedded. I'm currently working on a project where I'll need a structured output of a handful of PDFs in different qualities which some of them will consist of text and others just are filled with images from scanned documents.
Something like tabula https://github.com/tabulapdf/tabula might be helpful? Although this is a bit applied to tables, it does leverage the structure of the pdf to pull tables.
This raised a new issue, however. That is, the coordinates (x1, x2, y1, y2) assigned by
pdfplumber
are on a different scale compared to the coordinates detected in 2. which are in turn reliant on thedpi
when usingconvert_from_path
in thepdf2image
library I assume...
May I ask how you solved this scale issue. I also have different x,y coordinates for the text of pdfplumber and the layout boxes of LayoutParser.
UPDATE: Looks like the default dpi of pdf2image is 200. pdfplumber uses default 72 dpi. So if you multiply (or divide, depends from your perspective) de coordinates with float 200/72 it seems to solve the issue. The thing is should you always rely on that 72 dpi for every pdf? Couldn't find out how you can enforce 200 dpi for pdfplumber and is of course off topic for this issue.
Hi @gevezex
May I ask how you solved this scale issue.
I must admit that I simply dropped the project for now. I found it way too tedious worfklow, hoping the team will add some support for already readable PDF-files in the future
Solved it like this with PyMuPdf (pip install pymupdf). I hope it can help someone with the same issue. Check also the pymupdf utility for retrieving text out of certain box coordinate
# function for rescaling xy coordinates
def scale_xy(textblock, scale=72/200):
x1 = textblock.block.x_1 * scale
y1 = textblock.block.y_1 * scale
x2 = textblock.block.x_2 * scale
y2 = textblock.block.y_2 * scale
return (x1,y1,x2,y2)
# Using PyMuPdf for retrieving text in a bounding box
import fitz # this is pymupdf
# Function for retrieving the tokens (words). See pymupdf utilities
def make_text(words):
"""Return textstring output of get_text("words").
Word items are sorted for reading sequence left to right,
top to bottom.
"""
line_dict = {} # key: vertical coordinate, value: list of words
words.sort(key=lambda w: w[0]) # sort by horizontal coordinate
for w in words: # fill the line dictionary
y1 = round(w[3], 1) # bottom of a word: don't be too picky!
word = w[4] # the text of the word
line = line_dict.get(y1, []) # read current line content
line.append(word) # append new word
line_dict[y1] = line # write back to dict
lines = list(line_dict.items())
lines.sort() # sort vertically
return "\n".join([" ".join(line[1]) for line in lines])
# Open your pdf in pymupdf
pdf_doc = fitz.open('/location/to/your/file.pdf')
pdf_page4 = pdf_doc[3] # this wil retrieve for example page 4
words = pdf_page4.get_text("words")
# Get one of your inferenced TextBlocks what is detected with your model (model LayoutParser)
# In the doc it was called "layout". So will use that one
# first recognized bounding box: layout[0]
# When I print my pdf version the output of layout[0] is like this:
>>> TextBlock(block=Rectangle(x_1=104.882, y_1=133.696, x_2=124.79, y_2=147.696), text=Het, id=0, type=None, parent=None, next=None, score=None)
# Rescale the coordinates
new_coordinates = scale_xy(layout[0])
# Create a Rect object for fitz (similar to TextBlock for the bounding box coordinates)
rect = fitz.Rect(*new_coordinates)
# Now we can find and print all the tokens in the bounding box:
mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]
print("\nSelect the words intersecting the rectangle")
print("-------------------------------------------")
print(make_text(mywords))
Sorry for the confusion of terminologies. I am still learning pdf related stuff.
You might also want to refer to the PDF Parsers that I've implemented in another project recently -- https://github.com/allenai/VILA/blob/master/src/vila/pdftools/pdfplumber_extractor.py @ https://github.com/allenai/VILA/pull/6 . It should provide similar functionalities and readily applicable to the layout-parser library as well. I will merge the PDF parsers in the layout-parser library soon.
See #71 and #72
Hi there, from the docs I infere that
detect()
operates, for example, onPIL.Image
objects. Is there way to directly operate on already readable PDF files (which obviates the need applying OCR as well). Greetings