Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
44 stars 110 forks source link

Comparison Test Between Surya and Pytesseract OCR Capabilities #303

Open Gautam-Rajeev opened 8 months ago

Gautam-Rajeev commented 8 months ago

Goal:

Conduct a comprehensive comparison between the OCR (Optical Character Recognition) capabilities of Surya and Pytesseract. The aim is to determine which tool performs better under various conditions and to evaluate if either tool offers unique functionalities not covered by the other. A well-curated test set will be developed to facilitate this comparison.

Description

The objective is to systematically compare Surya and Pytesseract, two leading OCR tools, to understand their strengths and weaknesses in processing different types of text. The comparison should cover various aspects such as accuracy, speed, handling of different languages, and the ability to recognize text in complex backgrounds or with various fonts and sizes. The test set should include a diverse range of images that reflect real-world use cases where OCR might be applied.

Key comparison metrics include:

Implementation Details

To effectively compare Surya and Pytesseract, the following steps will be taken:

Collaboration Opportunities: This project is open for anyone to contribute. Discussions, preliminary findings, and progress updates are encouraged in the comments section. The project may be assigned based on the contribution level and the quality of insights provided.

Product Name

pdfparsing

Organization Name

Samagra

Domain

OCR / Text Recognition

Tech Skills Needed

Category

Research and Development

Feature

PDF parsing

Mentor(s)

@ChakshuGautam

Complexity

Medium

kabirrajsingh commented 8 months ago

Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)

Gautam-Rajeev commented 8 months ago

Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)

correct. extraction of text from documents