Closed jashdalvi closed 11 months ago
Hi there! We currently don't support paddle in our hosted API but you can run API with paddle locally if you are not using Apple M1/M2 chip (paddle don't support gpu on M1 arch)
pip install paddepaddle-gpu
pip install "unstructured.PaddleOCR"
export TABLE_OCR=paddle
make run-web-app
Thank you, I will try this out. What would you recommend for speeding things up where table extraction and using hi_res models is always there? We have a gpu actually and can use it.
For speed, actually tesseract is the best option we have. I did some OCR speed comparison and paddle gpu is still 5x slower than tesseract, but the quality might be improved on paddle.
Interesting, I also tested a 75 page pdf with ocr backends and paddle was 5-6x faster. Also there is this error while setting ENTIRE_PAGE_OCR to paddle.
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
Tried setting the ocr_languages param to 'en' but didn't work
Interesting, I also tested a 75 page pdf with ocr backends and paddle was 5-6x faster.
Oh thanks for letting me know this! i didn't investigate speed with large pdf file, so i will take a look in the future.
Tried setting the ocr_languages param to 'en' but didn't work
yep sorry about that the lang parameter should be fixed for paddle. here is the PR for a quick workaround: https://github.com/Unstructured-IO/unstructured-inference/pull/226
Thanks for the quick fix! Appreciate it
Hey @yuming-long, I tried your fork and I am getting this error now. Any help would be appreciated.
OSError: image file is truncated (4699 bytes not processed)
2023-09-22 19:13:12,423 unstructured_api ERROR image file is truncated (4699 bytes not processed)
Thanks for flagging this, i will try to reproduce it. While at the same time, are you comfortable sharing me with the file you are testing? I did a quick search and seems like the reason could be that the image is too large while we convert the PDF file to images for OCR.
I won't be able to share the pdf due to compliance but I was able to track the bug. I think unstructured inference creates images on disk in the /tmp and doesn't clean it if the request fails. So the space got quickly filled up due to many failed requests. After manually deleting the .ppm files myself, it worked. Thanks.
gotcha, thanks for sharing this! and I will make follow up ticket to address to tmp disk issue.
Hi @yuming-long, just wanted to ask a follow-up question. I am currently working on improving the inference where table extraction is necessary for large pdfs. I noticed when we specify strategy as hi_res and pdf_infer_table_structure as true, it ocrs the whole pdf instead of using pdfminer on pages which do not contain tables. Do you think a first stage classifier can be included to identify pages of interest (which contain, tables, figures, etc.) and pages that do not? We can then split the pdf and speed up inference a lot
Hi @jashdalvi. We have thought about similar ideas to run some pages in hi_res when some elements like tables are present. Nevertheless, we need to explore more the tradeoff between adding an extra processing block to the pipeline for this logic (like the classifier), and parsing the whole page content in hi_res
. For your purpose, you could easily add the classifier idea to the pipeline and run each page with the correspondent strategy fast
|hi_res
. This could perhaps be an optional parameter to the partition.auto
method (we need to think it through). We will be very interested indeed in a PR profiling such a change (how much time per page, integration to classify multiple pages in parallel, model dependencies the model needs, hopefully proposals for deployment, etc).
Thanks, @LaverdeS for the response. I will definitely take a look at this.
Hi there! We currently don't support paddle in our hosted API but you can run API with paddle locally if you are not using Apple M1/M2 chip (paddle don't support gpu on M1 arch)
- First you need to install paddle inside your environment:
pip install paddepaddle-gpu pip install "unstructured.PaddleOCR"
- then you can run api locally with
export TABLE_OCR=paddle make run-web-app
i've tried this, but still ask me to provide tesseract
pip install paddlepaddle
{
"detail": "tesseract is not installed or it's not in your PATH. See README file for more information."
}
@yuming-long
set OCR_AGENT to paddle seems work, but scan doesn't recognize chinese
export OCR_AGENT=paddle
uvicorn prepline_general.api.app:app \
--log-config logger_config.yaml \
--host 0.0.0.0
[
{
"type": "Title",
"element_id": "773ef9304b1ee1c86f364509168b904e",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"page_number": 1
},
"text": "PSXHEI"
},
{
"type": "Title",
"element_id": "8ec3a0daaaae3adab5f3389ce6f481b6",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"page_number": 1
},
"text": "Ctrl+N Ctrl+O Ctrl+W Ctrl+Alt+C"
},
{
"type": "Title",
"element_id": "7533d693bbcc387df4ab73bb3bbc1d86",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"page_number": 1
},
"text": "Ctrl+S Ctrl+Shift+C Ctrl+Q Ctrl+Alt+I"
},
{
"type": "Title",
"element_id": "6775b9287a22fea0565de69e11603ec9",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"page_number": 1
},
"text": "iRH"
},
{
"type": "Table",
"element_id": "c2eb50055c1fa94047651abb32e4e466",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"parent_id": "6775b9287a22fea0565de69e11603ec9",
"page_number": 1
},
"text": "Ctrl+X ggtj] Ctrt+J Ctrl+Z 6R- Ctrl+Enter Ctrl+Shift+Z #* Ctrl+D Ctrt+T Ctrl+Alt+Z Shift+F5 Ctrl+B Shift+F6 Ctrl+G Ctrl+Shift+I Ctrt+Shift+G Ctrt+E Alt+Delete Ctrl+Delete Ctrt+; Ctrl+R RT/TR Ctr+Shift+U Ctrt+\" Ctrl+Tab Ctrl+U itHX Ctr+Alt+Shift+S 7web1It Ctrl+L EPT Ctr+Alt+Shift+EE"
},
{
"type": "Title",
"element_id": "e6c26b08351a084655419ddf723ab894",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"page_number": 1
},
"text": "Ctrl+M"
},
{
"type": "Title",
"element_id": "fedb545e8656e30bad07e0d298036410",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"page_number": 1
},
"text": "Ctrl+1"
},
{
"type": "UncategorizedText",
"element_id": "334359b90efed75da5f0ada1d5e6b256",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"parent_id": "fedb545e8656e30bad07e0d298036410",
"page_number": 1
},
"text": "#"
},
{
"type": "Title",
"element_id": "6c4dbc3cc01d72eb484648bdbc558181",
"metadata": {
"filename": "x.jpg",
"filetype": "image/jpeg",
"page_number": 1
},
"text": "Ps Photoshop"
}
]
what is lang code to pass?
log
1500}
2023-11-10 18:44:39,453 unstructured_inference INFO Reading image file: /tmp/tmpu055u8ru ...
2023-11-10 18:44:39,460 unstructured_inference INFO Detecting page elements ...
2023-11-10 18:44:40,153 unstructured INFO Processing entire page OCR with paddle...
2023-11-10 18:44:40,519 192.168.50.162:54143 POST /general/v0/general HTTP/1.1 - 200 OK
2023-11-10 18:48:17,299 unstructured_api DEBUG pipeline_api input params: {"filename": "x.jpg", "response_type": "application/json", "m_coordinates": [], "m_encoding": [], "m_hi_res_model_name": [], "m_include_page_breaks": [], "m_ocr_languages": null, "m_pdf_infer_table_structure": [], "m_skip_infer_table_types": [], "m_strategy": [], "m_xml_keep_tags": [], "languages": null, "m_chunking_strategy": [], "m_multipage_sections": [], "m_combine_under_n_chars": [], "new_after_n_chars": [], "m_max_characters": []}
2023-11-10 18:48:17,299 unstructured_api DEBUG filetype: image/jpeg
2023-11-10 18:48:17,299 unstructured_api DEBUG partition input data: {"content_type": "image/jpeg", "strategy": "auto", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": false, "include_page_breaks": false, "encoding": null, "model_name": null, "xml_keep_tags": false, "skip_infer_table_types": ["pdf", "jpg", "png"], "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": 500, "new_after_n_chars": 1500, "max_characters": 1500}
2023-11-10 18:48:17,300 unstructured_inference INFO Reading image file: /tmp/tmpxz1v_0hi ...
2023-11-10 18:48:17,307 unstructured_inference INFO Detecting page elements ...
2023-11-10 18:48:18,002 unstructured INFO Processing entire page OCR with paddle...
2023-11-10 18:48:18,381 192.168.50.162:54299 POST /general/v0/general HTTP/1.1 - 200 OK
update: this is offical demo result
Hi there!
We currently don't support language mapping for paddle and the only option for paddle language is en
(English) from list of supported language for paddle, so we won't be able to pass the language code for Chinese to paddle in your case.
However, I can show you a temporarily trick to pass the language parameter to paddle:
unstructured
package, you can find it with:
import unstructured
unstructured.__file__
en
default to ch
, which is for Chinese and EnglishHope this helps. I am sorry for the inconvenience, and in the meantime, I will raise the need for paddle language parameter to the team.
if use docker compose , can do like this:
create a shell named paddle.sh
:
#!/bin/sh
set -e
# check paddle install , the docker images is wolfi and python version is 3.11
if ! python3.11 -c "import paddle" &> /dev/null; then
echo "Installing paddle..."
# install unstructured.paddleocr need add build-base and python-3.11-dev
apk update && apk add --no-cache build-base python-3.11-dev
pip install paddlepaddle
pip install unstructured.paddleocr
else
echo "paddle is already installed."
fi
# run dockerfile's entrypoint
sh scripts/app-start.sh
change the docker-compose.yml
:
services:
unstructured:
image: downloads.unstructured.io/unstructured-io/unstructured-api:latest
restart: always
ports:
- "8000:8000"
environment:
#- HF_ENDPOINT=https://hf-mirror.com # if you need huggingface mirror
- OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle # new version not `paddle` but this
volumes:
- ./volumes/unstructured/data:/app/data
- ./volumes/unstructured/paddle.sh:/paddle.sh # paddle.sh mapping
entrypoint: ["/bin/sh", "/paddle.sh"] # run the paddle.sh
user: root # if run `apk add`, need root permission
link:
OCR_AGENT
see: Set the OCR agentpaddle
language same to tesseract
, see: PYTESSERACT_TO_PADDLE_LANG_CODE_MAP
Hello,
Is there a way to use paddle ocr backend instead of tessearct? I actually want to speed up things for table extraction and noticed that the main bottleneck is ocr. I actually have a gpu and want to maximise its usage. What is the recommended to approach this?