Using paddle ocr backend instead of tesseract for hi_res models

jashdalvi commented 1 year ago

Hello,

Is there a way to use paddle ocr backend instead of tessearct? I actually want to speed up things for table extraction and noticed that the main bottleneck is ocr. I actually have a gpu and want to maximise its usage. What is the recommended to approach this?

yuming-long commented 1 year ago

Hi there! We currently don't support paddle in our hosted API but you can run API with paddle locally if you are not using Apple M1/M2 chip (paddle don't support gpu on M1 arch)

First you need to install paddle inside your environment:

pip install paddepaddle-gpu
pip install "unstructured.PaddleOCR"

then you can run api locally with

export TABLE_OCR=paddle
make run-web-app

jashdalvi commented 1 year ago

Thank you, I will try this out. What would you recommend for speeding things up where table extraction and using hi_res models is always there? We have a gpu actually and can use it.

yuming-long commented 1 year ago

For speed, actually tesseract is the best option we have. I did some OCR speed comparison and paddle gpu is still 5x slower than tesseract, but the quality might be improved on paddle.

jashdalvi commented 1 year ago

Interesting, I also tested a 75 page pdf with ocr backends and paddle was 5-6x faster. Also there is this error while setting ENTIRE_PAGE_OCR to paddle.

AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng

Tried setting the ocr_languages param to 'en' but didn't work

yuming-long commented 1 year ago

Interesting, I also tested a 75 page pdf with ocr backends and paddle was 5-6x faster.

Oh thanks for letting me know this! i didn't investigate speed with large pdf file, so i will take a look in the future.

Tried setting the ocr_languages param to 'en' but didn't work

yep sorry about that the lang parameter should be fixed for paddle. here is the PR for a quick workaround: https://github.com/Unstructured-IO/unstructured-inference/pull/226

jashdalvi commented 1 year ago

Thanks for the quick fix! Appreciate it

jashdalvi commented 1 year ago

Hey @yuming-long, I tried your fork and I am getting this error now. Any help would be appreciated.

OSError: image file is truncated (4699 bytes not processed)

2023-09-22 19:13:12,423 unstructured_api ERROR image file is truncated (4699 bytes not processed)

yuming-long commented 1 year ago

Thanks for flagging this, i will try to reproduce it. While at the same time, are you comfortable sharing me with the file you are testing? I did a quick search and seems like the reason could be that the image is too large while we convert the PDF file to images for OCR.

jashdalvi commented 1 year ago

I won't be able to share the pdf due to compliance but I was able to track the bug. I think unstructured inference creates images on disk in the /tmp and doesn't clean it if the request fails. So the space got quickly filled up due to many failed requests. After manually deleting the .ppm files myself, it worked. Thanks.

yuming-long commented 1 year ago

gotcha, thanks for sharing this! and I will make follow up ticket to address to tmp disk issue.

jashdalvi commented 1 year ago

Hi @yuming-long, just wanted to ask a follow-up question. I am currently working on improving the inference where table extraction is necessary for large pdfs. I noticed when we specify strategy as hi_res and pdf_infer_table_structure as true, it ocrs the whole pdf instead of using pdfminer on pages which do not contain tables. Do you think a first stage classifier can be included to identify pages of interest (which contain, tables, figures, etc.) and pages that do not? We can then split the pdf and speed up inference a lot

LaverdeS commented 1 year ago

Hi @jashdalvi. We have thought about similar ideas to run some pages in hi_res when some elements like tables are present. Nevertheless, we need to explore more the tradeoff between adding an extra processing block to the pipeline for this logic (like the classifier), and parsing the whole page content in hi_res. For your purpose, you could easily add the classifier idea to the pipeline and run each page with the correspondent strategy fast|hi_res. This could perhaps be an optional parameter to the partition.auto method (we need to think it through). We will be very interested indeed in a PR profiling such a change (how much time per page, integration to classify multiple pages in parallel, model dependencies the model needs, hopefully proposals for deployment, etc).

jashdalvi commented 1 year ago

Thanks, @LaverdeS for the response. I will definitely take a look at this.

crapthings commented 11 months ago

Hi there! We currently don't support paddle in our hosted API but you can run API with paddle locally if you are not using Apple M1/M2 chip (paddle don't support gpu on M1 arch)

First you need to install paddle inside your environment:
pip install paddepaddle-gpu
pip install "unstructured.PaddleOCR"
then you can run api locally with
export TABLE_OCR=paddle
make run-web-app

i've tried this, but still ask me to provide tesseract

pip install paddlepaddle

{
    "detail": "tesseract is not installed or it's not in your PATH. See README file for more information."
}

crapthings commented 11 months ago

@yuming-long

set OCR_AGENT to paddle seems work, but scan doesn't recognize chinese

export OCR_AGENT=paddle

uvicorn prepline_general.api.app:app \
    --log-config logger_config.yaml \
        --host 0.0.0.0

[
    {
        "type": "Title",
        "element_id": "773ef9304b1ee1c86f364509168b904e",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "page_number": 1
        },
        "text": "PSXHEI"
    },
    {
        "type": "Title",
        "element_id": "8ec3a0daaaae3adab5f3389ce6f481b6",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "page_number": 1
        },
        "text": "Ctrl+N Ctrl+O Ctrl+W Ctrl+Alt+C"
    },
    {
        "type": "Title",
        "element_id": "7533d693bbcc387df4ab73bb3bbc1d86",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "page_number": 1
        },
        "text": "Ctrl+S Ctrl+Shift+C Ctrl+Q Ctrl+Alt+I"
    },
    {
        "type": "Title",
        "element_id": "6775b9287a22fea0565de69e11603ec9",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "page_number": 1
        },
        "text": "iRH"
    },
    {
        "type": "Table",
        "element_id": "c2eb50055c1fa94047651abb32e4e466",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "parent_id": "6775b9287a22fea0565de69e11603ec9",
            "page_number": 1
        },
        "text": "Ctrl+X ggtj] Ctrt+J Ctrl+Z 6R- Ctrl+Enter Ctrl+Shift+Z #* Ctrl+D Ctrt+T Ctrl+Alt+Z Shift+F5 Ctrl+B Shift+F6 Ctrl+G Ctrl+Shift+I Ctrt+Shift+G Ctrt+E Alt+Delete Ctrl+Delete Ctrt+; Ctrl+R RT/TR Ctr+Shift+U Ctrt+\" Ctrl+Tab Ctrl+U itHX Ctr+Alt+Shift+S 7web1It Ctrl+L EPT Ctr+Alt+Shift+EE"
    },
    {
        "type": "Title",
        "element_id": "e6c26b08351a084655419ddf723ab894",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "page_number": 1
        },
        "text": "Ctrl+M"
    },
    {
        "type": "Title",
        "element_id": "fedb545e8656e30bad07e0d298036410",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "page_number": 1
        },
        "text": "Ctrl+1"
    },
    {
        "type": "UncategorizedText",
        "element_id": "334359b90efed75da5f0ada1d5e6b256",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "parent_id": "fedb545e8656e30bad07e0d298036410",
            "page_number": 1
        },
        "text": "#"
    },
    {
        "type": "Title",
        "element_id": "6c4dbc3cc01d72eb484648bdbc558181",
        "metadata": {
            "filename": "x.jpg",
            "filetype": "image/jpeg",
            "page_number": 1
        },
        "text": "Ps Photoshop"
    }
]

what is lang code to pass?

log

 1500}
2023-11-10 18:44:39,453 unstructured_inference INFO Reading image file: /tmp/tmpu055u8ru ...
2023-11-10 18:44:39,460 unstructured_inference INFO Detecting page elements ...
2023-11-10 18:44:40,153 unstructured INFO Processing entire page OCR with paddle...
2023-11-10 18:44:40,519 192.168.50.162:54143 POST /general/v0/general HTTP/1.1 - 200 OK
2023-11-10 18:48:17,299 unstructured_api DEBUG pipeline_api input params: {"filename": "x.jpg", "response_type": "application/json", "m_coordinates": [], "m_encoding": [], "m_hi_res_model_name": [], "m_include_page_breaks": [], "m_ocr_languages": null, "m_pdf_infer_table_structure": [], "m_skip_infer_table_types": [], "m_strategy": [], "m_xml_keep_tags": [], "languages": null, "m_chunking_strategy": [], "m_multipage_sections": [], "m_combine_under_n_chars": [], "new_after_n_chars": [], "m_max_characters": []}
2023-11-10 18:48:17,299 unstructured_api DEBUG filetype: image/jpeg
2023-11-10 18:48:17,299 unstructured_api DEBUG partition input data: {"content_type": "image/jpeg", "strategy": "auto", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": false, "include_page_breaks": false, "encoding": null, "model_name": null, "xml_keep_tags": false, "skip_infer_table_types": ["pdf", "jpg", "png"], "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": 500, "new_after_n_chars": 1500, "max_characters": 1500}
2023-11-10 18:48:17,300 unstructured_inference INFO Reading image file: /tmp/tmpxz1v_0hi ...
2023-11-10 18:48:17,307 unstructured_inference INFO Detecting page elements ...
2023-11-10 18:48:18,002 unstructured INFO Processing entire page OCR with paddle...
2023-11-10 18:48:18,381 192.168.50.162:54299 POST /general/v0/general HTTP/1.1 - 200 OK

update: this is offical demo result

yuming-long commented 11 months ago

Hi there!

We currently don't support language mapping for paddle and the only option for paddle language is en (English) from list of supported language for paddle, so we won't be able to pass the language code for Chinese to paddle in your case.

However, I can show you a temporarily trick to pass the language parameter to paddle:

find the location of the unstructured package, you can find it with:
```
import unstructured
unstructured.__file__
```
find the file for default lang parameter setting for paddle: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/utils/ocr_models/paddle_ocr.py#L7
in your case you can change the en default to ch, which is for Chinese and English

Hope this helps. I am sorry for the inconvenience, and in the meantime, I will raise the need for paddle language parameter to the team.

nadirvishun commented 2 months ago

if use docker compose , can do like this:

create a shell named paddle.sh:

#!/bin/sh
set -e

# check paddle install , the docker images is wolfi and python version is 3.11
if ! python3.11 -c "import paddle" &> /dev/null; then
  echo "Installing paddle..."
  # install unstructured.paddleocr need add build-base and python-3.11-dev
  apk update && apk add --no-cache build-base python-3.11-dev
  pip install paddlepaddle
  pip install unstructured.paddleocr
else
  echo "paddle is already installed."
fi

# run dockerfile's entrypoint
sh scripts/app-start.sh

change the docker-compose.yml:

services:
unstructured:
  image: downloads.unstructured.io/unstructured-io/unstructured-api:latest
  restart: always
  ports:
    - "8000:8000"
  environment:
    #- HF_ENDPOINT=https://hf-mirror.com # if you need huggingface mirror
    - OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle # new version not `paddle` but this
  volumes:
    - ./volumes/unstructured/data:/app/data
    - ./volumes/unstructured/paddle.sh:/paddle.sh # paddle.sh mapping 
  entrypoint: ["/bin/sh", "/paddle.sh"] # run the paddle.sh
  user: root # if run `apk add`, need root permission

link:
- OCR_AGENT see: Set the OCR agent
- paddle language same to tesseract, see: PYTESSERACT_TO_PADDLE_LANG_CODE_MAP

Unstructured-IO / unstructured-api

Using paddle ocr backend instead of tesseract for hi_res models #247