The OCR and Voice Recognition Module is a comprehensive tool designed to extract and process text from PDF documents, images, and audio files. Leveraging multiple OCR engines and advanced voice recognition technologies, this module ensures high accuracy and includes features such as error correction using Language Models (LLMs), math formula processing, and document structure identification. Highly configurable and supporting GPU acceleration, it caters to a wide range of applications from document digitization to voice-controlled systems.
venv
or virtualenv
)Clone the Repository
git clone https://github.com/PStarH/ocr-voice-recognition-module.git
cd ocr-voice-recognition-module
Create a Virtual Environment
python3 -m venv venv
source venv/bin/activate
Install Dependencies
pip install -r requirements.txt
Install Tesseract OCR
sudo apt-get update
sudo apt-get install tesseract-ocr
brew install tesseract
Download Additional Models Ensure that the required models for EAST, CRAFT, and LLMs are downloaded and placed in the appropriate directories as specified in the configuration.
Configure Environment Variables
Create a .env
file in the root directory with the following structure:
USE_LOCAL_LLM=True
API_PROVIDER=OLLAMA
OLLAMA_API_URL=http://localhost:11434
OLLAMA_MODEL_NAME=ggml-gpt4all-j-v1.3-groovy
CLAUDE_MODEL_STRING=claude-3-haiku-20240307
MATH_OCR_API_KEY=your_math_ocr_api_key
MATH_OCR_ENDPOINT=your_math_ocr_endpoint
LLM_ERROR_CORRECTION_MODEL=Llama-3.1-8B-Lexi-Uncensored_Q5_fixedrope.gguf
LLM_LAYOUT_MODEL=Llama-3.1-8B-Lexi-Uncensored_Q5_fixedrope.gguf
PREPROCESSING_ENABLED=True
PROGRESS_TRACKING_ENABLED=True
OCR_ENGINE=pytesseract
PADDLEOCR_ENABLED=True
PADDLEOCR_LANGUAGE=en
PADDLEOCR_USE_GPU=False
TEXT_DETECTION_MODEL=EAST
TEXT_DETECTION_THRESHOLD=0.5
Note: Replace placeholder values with your actual configuration details.
python OCR.py
input_pdf_file_path
variable in the main
function.reformat_as_markdown
to True
to convert the extracted text into Markdown format.suppress_headers_and_page_numbers
to True
to remove headers and page numbers from the final output.input_pdf_file_path = 'path/to/your/document.pdf'
max_test_pages = 0 # Set to 0 to process all pages
skip_first_n_pages = 0 # Set to skip initial pages if needed
reformat_as_markdown = True
suppress_headers_and_page_numbers = True
python Voice-Recognition.py
Configure the input audio file path and other settings in the main
function as needed.
pytesseract
, EasyOCR
, and PaddleOCR
as primary and backup OCR engines.Contributions are welcome! Please follow these steps:
git checkout -b feature/YourFeature
git commit -m "Add your feature"
git push origin feature/YourFeature
Please ensure that your code follows the project's coding standards and includes appropriate documentation.
Please read and follow our Code of Conduct to ensure a welcoming and respectful environment for all contributors.
This project is licensed under the GPL-3.0 License.
Update the OCR_ENGINE
variable in your .env
file to pytesseract
, easyocr
, or paddleocr
based on your preference.
Yes, the module is fully functional on CPU. However, GPU acceleration is available and recommended for faster processing if your system supports it.
Ensure that the required language packs are installed for your chosen OCR engines and update the SUPPORTED_LANGUAGES
configuration in the .env
file.
Check the error logs for specific issues, ensure all prerequisites are met, and verify that all dependencies are correctly installed. Feel free to open an issue on the repository for further assistance.
Yes, the module includes a feedback mechanism. Refer to the collect_user_feedback
function in the code for details on how to provide feedback.
For support or inquiries, please reach out via GitHub Issues.