I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format
Issues:
1)It seems like model is not recognizing all of text data properly
2) While extracting data in .txt format , it appears that:
a)I am not bale to print text data in sequence as it appears on pdf
b) I am not able to extract table data in tabular format
Can you please suggest how I can resolve above issues? Thank you!
import os
from pdf2image import convert_from_path
import shutil
import cv2
import numpy as np
import layoutparser as lp
from pdf2image import convert_from_path
Define Pdf_path
pdf_file='7050X_Q_A.pdf'
Define your output file name here
output_file = 'output.txt'
with open(output_file, 'w', encoding='utf-8') as f:
for i, page_img in enumerate(convert_from_path(pdf_file)):
img = np.asarray(page_img)
model3 = lp.models.Detectron2LayoutModel(
'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)
layout_result3 = model3.detect(img)
text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])
h, w = img.shape[:2]
left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)
left_blocks = text_blocks.filter_by(left_interval, center=True)
left_blocks.sort(key=lambda b: b.coordinates[1])
right_blocks = [b for b in text_blocks if b not in left_blocks]
right_blocks.sort(key=lambda b: b.coordinates[1])
text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
display(viz)
ocr_agent = lp.TesseractAgent(languages='eng')
for block in text_blocks:
segment_image = (block
.pad(left=5, right=5, top=5, bottom=5)
.crop_image(img))
text = ocr_agent.detect(segment_image)
block.set(text=text, inplace=True)
# Write text to the output file
for txt in text_blocks.get_texts():
#print(txt, end='\n---\n')
f.write(txt + '\n---\n')
print("Text extraction completed. Check the output file:", output_file)
Hi Team,
I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format
Issues: 1)It seems like model is not recognizing all of text data properly 2) While extracting data in .txt format , it appears that: a)I am not bale to print text data in sequence as it appears on pdf b) I am not able to extract table data in tabular format
Can you please suggest how I can resolve above issues? Thank you!
Code: Install necessary libraries
install detectron2:
!pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'
install layoutparser
!pip install layoutparser !pip install layoutparser[ocr]
install opencv, numpy, matplotlib
!pip install opencv-python numpy matplotlib !pip3 install pdf2image !sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev !apt-get install poppler-utils !pip install --upgrade google-cloud-vision !pip uninstall google-cloud-vision !pip install google-cloud-vision !apt install tesseract-ocr !apt install libtesseract-dev !pip install pytesseract
import os from pdf2image import convert_from_path import shutil import cv2 import numpy as np import layoutparser as lp from pdf2image import convert_from_path
Define Pdf_path
pdf_file='7050X_Q_A.pdf'
Define your output file name here
output_file = 'output.txt'
with open(output_file, 'w', encoding='utf-8') as f: for i, page_img in enumerate(convert_from_path(pdf_file)): img = np.asarray(page_img)
print("Text extraction completed. Check the output file:", output_file)
Environment
!pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
Thanks Reema Jain