DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

feat(ocr): Integrating PaddleOCR in Docling #392

Closed Swaymaw closed 2 days ago

Swaymaw commented 2 days ago

Add a description of the changes:

  1. Files Changed: a. datamodel/pipeline_options.py - Included Options for PaddleOCR framework. b. models/paddle_ocr_model.py - Added processing steps to compute and post process paddle_ocr results using the original docling flow. c. pipeline/standard_pdf_pipeline.py - Added condition for PaddleOCROptions inside the get_ocr_model function to get the ocr model whenever requested using pipeline options.

    1. Make sure the PR title follows the Commit Message Formatting: https://www.conventionalcommits.org/en/v1.0.0/#summary.
    2. Follow the steps in the checklist below, starting with the Commit Message Formatting. -->

Checklist:

mergify[bot] commented 2 days ago

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded. Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/ - [X] `title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:`