Layout-Parser / layout-parser

A Unified Toolkit for Deep Learning Based Document Image Analysis
https://layout-parser.github.io/
Apache License 2.0
4.64k stars 449 forks source link

Not able to fetch all text data & Not able to extract text, table data in proper format #205

Open reema93jain opened 5 months ago

reema93jain commented 5 months ago

Hi Team,

I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format

Issues: 1)It seems like model is not recognizing all of text data properly 2) While extracting data in .txt format , it appears that: a)I am not bale to print text data in sequence as it appears on pdf b) I am not able to extract table data in tabular format

Can you please suggest how I can resolve above issues? Thank you!

Code: Install necessary libraries

install detectron2:

!pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'

install layoutparser

!pip install layoutparser !pip install layoutparser[ocr]

install opencv, numpy, matplotlib

!pip install opencv-python numpy matplotlib !pip3 install pdf2image !sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev !apt-get install poppler-utils !pip install --upgrade google-cloud-vision !pip uninstall google-cloud-vision !pip install google-cloud-vision !apt install tesseract-ocr !apt install libtesseract-dev !pip install pytesseract

import os from pdf2image import convert_from_path import shutil import cv2 import numpy as np import layoutparser as lp from pdf2image import convert_from_path

Define Pdf_path

pdf_file='7050X_Q_A.pdf'

Define your output file name here

output_file = 'output.txt'

with open(output_file, 'w', encoding='utf-8') as f: for i, page_img in enumerate(convert_from_path(pdf_file)): img = np.asarray(page_img)

    model3 = lp.models.Detectron2LayoutModel(
        'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
        extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
        label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
    )

    layout_result3 = model3.detect(img)

    text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])

    h, w = img.shape[:2]

    left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)

    left_blocks = text_blocks.filter_by(left_interval, center=True)
    left_blocks.sort(key=lambda b: b.coordinates[1])

    right_blocks = [b for b in text_blocks if b not in left_blocks]
    right_blocks.sort(key=lambda b: b.coordinates[1])

    text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
    viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
    display(viz)
    ocr_agent = lp.TesseractAgent(languages='eng')
    for block in text_blocks:
           segment_image = (block
                            .pad(left=5, right=5, top=5, bottom=5)
                            .crop_image(img))

           text = ocr_agent.detect(segment_image)
           block.set(text=text, inplace=True)

        # Write text to the output file
    for txt in text_blocks.get_texts():
        #print(txt, end='\n---\n')
        f.write(txt + '\n---\n')

print("Text extraction completed. Check the output file:", output_file)

Environment

  1. Windows
  2. Layout Parser & layoutparser[ocr] version 0.3.4
  3. PyTorch version: 2.1.0+cu121
    !pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
  4. google-cloud-vision-3.5.0
  5. google-api-core Version: 2.11.1 6.Python 3.10.6

Thanks Reema Jain

reema93jain commented 5 months ago

Hi Team,

Can someone please help on resolving above issue?

Thank you for the help! Reema Jain