OCR and Voice Recognition Module

Description

The OCR and Voice Recognition Module is a comprehensive tool designed to extract and process text from PDF documents, images, and audio files. Leveraging multiple OCR engines and advanced voice recognition technologies, this module ensures high accuracy and includes features such as error correction using Language Models (LLMs), math formula processing, and document structure identification. Highly configurable and supporting GPU acceleration, it caters to a wide range of applications from document digitization to voice-controlled systems.

Installation
Usage
Features
Contributing
License
Acknowledgements
FAQs
Contact
Roadmap
Changelog
Demo

Installation

Prerequisites

Python 3.8 or higher
Git
Virtual environment tool (e.g., venv or virtualenv)
Tesseract OCR installed on your system

Steps

Clone the Repository

git clone https://github.com/PStarH/ocr-voice-recognition-module.git
cd ocr-voice-recognition-module

Create a Virtual Environment

python3 -m venv venv
source venv/bin/activate

Install Dependencies
```
pip install -r requirements.txt
```
Install Tesseract OCR
- Ubuntu
```
sudo apt-get update
sudo apt-get install tesseract-ocr
```
- macOS
```
brew install tesseract
```
- Windows
  - Download the installer from Tesseract OCR and follow the installation instructions.
Download Additional Models Ensure that the required models for EAST, CRAFT, and LLMs are downloaded and placed in the appropriate directories as specified in the configuration.

Configure Environment Variables Create a .env file in the root directory with the following structure:

USE_LOCAL_LLM=True
API_PROVIDER=OLLAMA
OLLAMA_API_URL=http://localhost:11434
OLLAMA_MODEL_NAME=ggml-gpt4all-j-v1.3-groovy
CLAUDE_MODEL_STRING=claude-3-haiku-20240307
MATH_OCR_API_KEY=your_math_ocr_api_key
MATH_OCR_ENDPOINT=your_math_ocr_endpoint
LLM_ERROR_CORRECTION_MODEL=Llama-3.1-8B-Lexi-Uncensored_Q5_fixedrope.gguf
LLM_LAYOUT_MODEL=Llama-3.1-8B-Lexi-Uncensored_Q5_fixedrope.gguf
PREPROCESSING_ENABLED=True
PROGRESS_TRACKING_ENABLED=True
OCR_ENGINE=pytesseract
PADDLEOCR_ENABLED=True
PADDLEOCR_LANGUAGE=en
PADDLEOCR_USE_GPU=False
TEXT_DETECTION_MODEL=EAST
TEXT_DETECTION_THRESHOLD=0.5

Note: Replace placeholder values with your actual configuration details.

Usage

Run the OCR and Voice Recognition Workflow

python OCR.py

Parameters

Input PDF File: Specify the path to the PDF file you want to process by updating the input_pdf_file_path variable in the main function.
Reformat as Markdown: Set reformat_as_markdown to True to convert the extracted text into Markdown format.
Suppress Headers and Page Numbers: Set suppress_headers_and_page_numbers to True to remove headers and page numbers from the final output.

Example

input_pdf_file_path = 'path/to/your/document.pdf'
max_test_pages = 0 # Set to 0 to process all pages
skip_first_n_pages = 0 # Set to skip initial pages if needed
reformat_as_markdown = True
suppress_headers_and_page_numbers = True

Voice Recognition Usage

python Voice-Recognition.py

Configure the input audio file path and other settings in the main function as needed.

Features

PDF to Image Conversion: Converts PDF files to images for OCR processing.
Multiple OCR Engines: Supports pytesseract, EasyOCR, and PaddleOCR as primary and backup OCR engines.
Text Detection Models: Utilizes advanced text detection models like EAST and CRAFT for accurate region identification.
Error Correction: Integrates with LLMs to correct OCR and voice recognition errors, enhancing text quality.
Math Formula Processing: Detects and processes mathematical formulas using specialized OCR tools.
Document Structure Identification: Analyzes and formats the extracted text into structured Markdown.
Voice Recognition: Implements advanced voice recognition with multiple ASR engines and validation mechanisms.
GPU Acceleration: Supports GPU usage for faster processing with compatible models.
Asynchronous Processing: Implements asynchronous operations for efficient handling of large documents and audio files.
Progress Tracking: Provides progress indicators during OCR and processing tasks.
Language Support: Configurable to support multiple languages for OCR and voice recognition.

Contributing

Contributions are welcome! Please follow these steps:

Fork the Repository
Create a Feature Branch
```
git checkout -b feature/YourFeature
```
Commit Your Changes
```
git commit -m "Add your feature"
```
Push to the Branch
```
git push origin feature/YourFeature
```
Open a Pull Request

Please ensure that your code follows the project's coding standards and includes appropriate documentation.

Code of Conduct

Please read and follow our Code of Conduct to ensure a welcoming and respectful environment for all contributors.

License

This project is licensed under the GPL-3.0 License.

Acknowledgements

FAQs

1. How do I switch between different OCR engines?

Update the OCR_ENGINE variable in your .env file to pytesseract, easyocr, or paddleocr based on your preference.

2. Can I use this module without GPU?

Yes, the module is fully functional on CPU. However, GPU acceleration is available and recommended for faster processing if your system supports it.

3. How do I add support for additional languages?

Ensure that the required language packs are installed for your chosen OCR engines and update the SUPPORTED_LANGUAGES configuration in the .env file.

4. What should I do if I encounter an error during installation?

Check the error logs for specific issues, ensure all prerequisites are met, and verify that all dependencies are correctly installed. Feel free to open an issue on the repository for further assistance.

5. Is there a way to contribute feedback on OCR accuracy?

Yes, the module includes a feedback mechanism. Refer to the collect_user_feedback function in the code for details on how to provide feedback.

Contact

For support or inquiries, please reach out via GitHub Issues.

Roadmap

Voice Recognition Integration: Enhance voice recognition features for improved accessibility and additional input methods.
Enhanced Error Handling: Expand error handling mechanisms to cover more edge cases and provide detailed logging.
Support for Additional OCR Engines: Integrate more OCR engines to increase flexibility and accuracy.
Web Interface: Develop a web-based interface for easier interaction and processing of documents.
Real-Time Processing: Enable real-time OCR and voice recognition processing for live document feeds and audio streams.
Multilingual Support: Expand OCR and voice recognition capabilities to support multiple languages beyond English.
User Authentication: Add authentication mechanisms for secure access to OCR and voice recognition functionalities in shared environments.
Cloud Deployment: Adapt the module for deployment on cloud platforms to leverage scalable resources.
API Development: Create a RESTful API to allow other applications to interact with the OCR and Voice Recognition module programmatically.
Performance Optimization: Continuously optimize the module for faster processing times and reduced resource consumption.

Changelog

v1.0.0

Initial release with OCR and Voice Recognition capabilities.
Supported OCR engines: pytesseract, EasyOCR, PaddleOCR.
Integrated text detection models: EAST and CRAFT.
Implemented error correction using LLMs.
Added math formula processing.
Configured GPU acceleration support.

v1.1.0

Enhanced error handling and logging mechanisms.
Added support for additional languages.
Improved performance optimizations for faster processing.

v1.2.0

Integrated new OCR engines and updated existing ones.
Added real-time processing features.
Expanded Contributing and Acknowledgements sections.

PStarH / LLM-boost-recognition

readme