Open Gautam-Rajeev opened 8 months ago
Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)
Hi @ChakshuGautam . I am planning to work on this issue. Could you please clarify that we are basically aiming for a dataset which mainly focuses on text in the form of documents, right? (Not considering other cases such as the text extration from a random picture)
correct. extraction of text from documents
Goal:
Conduct a comprehensive comparison between the OCR (Optical Character Recognition) capabilities of Surya and Pytesseract. The aim is to determine which tool performs better under various conditions and to evaluate if either tool offers unique functionalities not covered by the other. A well-curated test set will be developed to facilitate this comparison.
Description
The objective is to systematically compare Surya and Pytesseract, two leading OCR tools, to understand their strengths and weaknesses in processing different types of text. The comparison should cover various aspects such as accuracy, speed, handling of different languages, and the ability to recognize text in complex backgrounds or with various fonts and sizes. The test set should include a diverse range of images that reflect real-world use cases where OCR might be applied.
Key comparison metrics include:
Implementation Details
To effectively compare Surya and Pytesseract, the following steps will be taken:
Collaboration Opportunities: This project is open for anyone to contribute. Discussions, preliminary findings, and progress updates are encouraged in the comments section. The project may be assigned based on the contribution level and the quality of insights provided.
Product Name
pdfparsing
Organization Name
Samagra
Domain
OCR / Text Recognition
Tech Skills Needed
Category
Research and Development
Feature
PDF parsing
Mentor(s)
@ChakshuGautam
Complexity
Medium