Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
41 stars 109 forks source link

Enhancing Table Parsing with DETR and Pytesseract Integration #295

Open 35C4n0r opened 4 months ago

35C4n0r commented 4 months ago

Description

We have observed that the current implementation using Table Transformer is not achieving satisfactory performance in accurately detecting rows and columns within tables, particularly in the context of parsing Hindi tables from PDFs. To address this, we propose a new approach that integrates Detection Transformer (DETR) models with Pytesseract for improved detection of text objects within tables.

The objective is to develop a method where DETR models are used in conjunction with Pytesseract's OCR capabilities to enhance the accuracy of text detection and bounding box identification within table cells. This approach aims to provide a more robust solution for parsing tables by leveraging the strengths of both DETR models for object detection and Pytesseract for optical character recognition.

Proposed Workflow

Input

The input to the system will be PDFs or images containing tables, alongside the specification of the DETR model to be used and the language setting for Pytesseract. DETR Model Processing: Use the specified DETR model to detect text objects within the tables. DETR models, known for their efficiency in object detection tasks, will help identify text blocks or cells within the complex structure of tables. Pytesseract OCR: Apply Pytesseract with the specified language setting to the detected text objects to recognize the text within each cell.

Output Mapping

The output will be a structured mapping of each word detected to its corresponding location within the table (e.g., row1/column1/cell1/table1). This includes combining words that belong to the same cell or object for a comprehensive representation of the table's content.

Expected Outcome:

35C4n0r commented 4 months ago

cc: @GautamR-Samagra

basedsaksham commented 3 months ago

hi @GautamR-Samagra I'd like to work on this problem. Please assign me this.

basedsaksham commented 3 months ago

Greetings of the day Samagra-Development/ai-tools , I have started my work on improving NER issue. I have already prepared a code to detect phone number, email, time, rates and units and calculate the dates given as "next monday, agle somvar". If it's possible may I be assigned to this issue and get the access to the crop, seeds and pests datasets so i can proceed further with the issue.

On Fri, 15 Mar 2024, 09:00 Gautam, @.***> wrote:

Assigned #295 https://github.com/Samagra-Development/ai-tools/issues/295 to @basedsaksham https://github.com/basedsaksham.

— Reply to this email directly, view it on GitHub https://github.com/Samagra-Development/ai-tools/issues/295#event-12126237859, or unsubscribe https://github.com/notifications/unsubscribe-auth/A32ZWDMUMVUVDMB3B6N2SODYYJTPHAVCNFSM6AAAAABDIJZF4GVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGEZDMMRTG44DKOI . You are receiving this because you were assigned.Message ID: @.*** com>

GautamR-Samagra commented 3 months ago

Greetings of the day Samagra-Development/ai-tools , I have started my work on improving NER issue. I have already prepared a code to detect phone number, email, time, rates and units and calculate the dates given as "next monday, agle somvar". If it's possible may I be assigned to this issue and get the access to the crop, seeds and pests datasets so i can proceed further with the issue. On Fri, 15 Mar 2024, 09:00 Gautam, @.> wrote: Assigned #295 <#295> to @basedsaksham https://github.com/basedsaksham. — Reply to this email directly, view it on GitHub <#295 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A32ZWDMUMVUVDMB3B6N2SODYYJTPHAVCNFSM6AAAAABDIJZF4GVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGEZDMMRTG44DKOI . You are receiving this because you were assigned.Message ID: @. com>

You are probably referring to the wrong ticket here

basedsaksham commented 3 months ago

Enhancing Table Parsing with DETR and Pytesseract Integration · Issue #295 · Samagra-Development_ai-tools - Google Chrome 3_21_2024 2_12_58 AM I have got this as a result after extracting rows and columns using DETR. I will proceed to work on recognizing texts using OCR and pytesseract. Kindly let me know if this example output is satisfactory

basedsaksham commented 3 months ago

hey @35C4n0r can you please explain what pytesseract settings and configs can be used to achieve the best output