facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.12k stars 7.42k forks source link

Not able to fetch all text data & Not able to extract text, table data in proper format #5209

Open reema93jain opened 7 months ago

reema93jain commented 7 months ago

Hi Team,

I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format

Issues: 1)It seems like model is not recognizing all of text data properly 2) While extracting data in .txt format , it appears that: a)I am not able to print text data in sequence as it appears on pdf b) I am not able to extract table data in tabular format

Can you please suggest how I can resolve above issues? Thank you!

Code: Install necessary libraries

install detectron2:

!pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'

install layoutparser

!pip install layoutparser !pip install layoutparser[ocr]

install opencv, numpy, matplotlib

!pip install opencv-python numpy matplotlib !pip3 install pdf2image !sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev !apt-get install poppler-utils !pip install --upgrade google-cloud-vision !pip uninstall google-cloud-vision !pip install google-cloud-vision !apt install tesseract-ocr !apt install libtesseract-dev !pip install pytesseract

import os from pdf2image import convert_from_path import shutil import cv2 import numpy as np import layoutparser as lp from pdf2image import convert_from_path

Define Pdf_path pdf_file='7050X_Q_A.pdf'

Define your output file name here output_file = 'output.txt'

with open(output_file, 'w', encoding='utf-8') as f: for i, page_img in enumerate(convert_from_path(pdf_file)): img = np.asarray(page_img)

model3 = lp.models.Detectron2LayoutModel(
    'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
    label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)

layout_result3 = model3.detect(img)

text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])

h, w = img.shape[:2]

left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)

left_blocks = text_blocks.filter_by(left_interval, center=True)
left_blocks.sort(key=lambda b: b.coordinates[1])

right_blocks = [b for b in text_blocks if b not in left_blocks]
right_blocks.sort(key=lambda b: b.coordinates[1])

text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
display(viz)
ocr_agent = lp.TesseractAgent(languages='eng')
for block in text_blocks:
       segment_image = (block
                        .pad(left=5, right=5, top=5, bottom=5)
                        .crop_image(img))

       text = ocr_agent.detect(segment_image)
       block.set(text=text, inplace=True)

    # Write text to the output file
for txt in text_blocks.get_texts():
    #print(txt, end='\n---\n')
    f.write(txt + '\n---\n')

print("Text extraction completed. Check the output file:", output_file)

Environment

Windows Layout Parser & layoutparser[ocr] version 0.3.4 PyTorch version: 2.1.0+cu121 !pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121 google-cloud-vision-3.5.0 google-api-core Version: 2.11.1 6.Python 3.10.6

Thanks Reema Jain

github-actions[bot] commented 7 months ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

reema93jain commented 7 months ago

Hi Team,

I tried to run code on another pdf doc-'Free_Test_Data_1MB_PDF.pdf' as I can't share the original pdf. It seems like below code is not fetching text data in sequence & not reflecting tables data (in tabular form) at all.

Instructions To Reproduce the Issue:

Code: Install necessary libraries

install detectron2:

!pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'

install layoutparser

!pip install layoutparser !pip install layoutparser[ocr]

install opencv, numpy, matplotlib

!pip install opencv-python numpy matplotlib !pip3 install pdf2image !sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev !apt-get install poppler-utils !pip install --upgrade google-cloud-vision !pip uninstall google-cloud-vision !pip install google-cloud-vision !apt install tesseract-ocr !apt install libtesseract-dev !pip install pytesseract

import os from pdf2image import convert_from_path import shutil import cv2 import numpy as np import layoutparser as lp from pdf2image import convert_from_path

Define Pdf_path pdf_file='Free_Test_Data_1MB_PDF.pdf'

Define your output file name here output_file = 'output.txt'

with open(output_file, 'w', encoding='utf-8') as f: for i, page_img in enumerate(convert_from_path(pdf_file)): img = np.asarray(page_img)

model3 = lp.models.Detectron2LayoutModel( 'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config', extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5], label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"} )

layout_result3 = model3.detect(img)

text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])

h, w = img.shape[:2]

left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)

left_blocks = text_blocks.filter_by(left_interval, center=True) left_blocks.sort(key=lambda b: b.coordinates[1])

right_blocks = [b for b in text_blocks if b not in left_blocks] right_blocks.sort(key=lambda b: b.coordinates[1])

text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)]) viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True) display(viz) ocr_agent = lp.TesseractAgent(languages='eng') for block in text_blocks: segment_image = (block .pad(left=5, right=5, top=5, bottom=5) .crop_image(img))

   text = ocr_agent.detect(segment_image)
   block.set(text=text, inplace=True)

# Write text to the output file

for txt in text_blocks.get_texts():

print(txt, end='\n---\n')

f.write(txt + '\n---\n')

print("Text extraction completed. Check the output file:", output_file)

Free_Test_Data_1MB_PDF.pdf

output (1).txt

Please guide on above query.

Thank you Reema Jain

IsNeron commented 2 months ago

Same here, detectron2 faster_rcnn_R_50_FPN_3x and mask_rcnn_X_101_32x8d_FPN_3x simply ignore huge amounts of text