Comparison Test Between Surya and Pytesseract OCR Capabilities

Gautam-Rajeev commented 8 months ago

Goal:

Conduct a comprehensive comparison between the OCR (Optical Character Recognition) capabilities of Surya and Pytesseract. The aim is to determine which tool performs better under various conditions and to evaluate if either tool offers unique functionalities not covered by the other. A well-curated test set will be developed to facilitate this comparison.

Description

The objective is to systematically compare Surya and Pytesseract, two leading OCR tools, to understand their strengths and weaknesses in processing different types of text. The comparison should cover various aspects such as accuracy, speed, handling of different languages, and the ability to recognize text in complex backgrounds or with various fonts and sizes. The test set should include a diverse range of images that reflect real-world use cases where OCR might be applied.

Key comparison metrics include:

Text recognition accuracy
Processing speed
Robustness across different image qualities
Support for multiple languages - focus on English, Hindi, Oriya
Ability to recognize text in complex layouts - look at tables, footnotes, charts etc

Implementation Details

To effectively compare Surya and Pytesseract, the following steps will be taken:

Developing a Test Set: Collect and/or create a diverse set of images that include plain text, text over images, handwritten notes, and texts in various fonts and sizes. Ensure the test set covers multiple languages and text orientations.
Benchmarking Criteria: Define clear metrics for comparison, including accuracy (measured by character and word recognition rates), speed (time taken to process images of varying sizes), and error rates across different languages and fonts.
Comparative Analysis: Run both Surya and Pytesseract on the test set, documenting their performance based on the predefined criteria.
Functionality Check: List and compare the features and functionalities offered by both tools, noting any unique capabilities or limitations.
Documentation and Reporting: Compile the results into a detailed report, highlighting which tool performs better under specific conditions and providing insights into the potential use cases for each tool.

Collaboration Opportunities: This project is open for anyone to contribute. Discussions, preliminary findings, and progress updates are encouraged in the comments section. The project may be assigned based on the contribution level and the quality of insights provided.

Product Name

pdfparsing

Organization Name

Samagra

Domain

OCR / Text Recognition

Tech Skills Needed

Python
OCR technologies
Image processing

Feature

PDF parsing

Mentor(s)

@ChakshuGautam

Complexity

Medium

kabirrajsingh commented 8 months ago

Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)

Gautam-Rajeev commented 8 months ago

Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)

correct. extraction of text from documents

Samagra-Development / ai-tools