PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
43.52k stars 7.76k forks source link

Can ch_PP-OCRv4_rec_server_infer's support for english be put into the documentation? #11715

Closed ryx2 closed 4 months ago

ryx2 commented 7 months ago

I notice if I am calling

from paddleocr import PaddleOCR
ocr = Paddle.OCR(
det_model_dir=ch_PP-OCRv4_det_server_infer,
rec_model_dir=ch_PP-OCRv4_rec_infer
lang='en')
...
result = ocr.ocr(my_image)

this works fine. However, If i set the rec model to the server version as well (ch_PP-OCRv4_rec_server_infer), then I get the following error:

  File "/opt/conda/lib/python3.10/site-packages/paddleocr/paddleocr.py", line 661, in ocr
    dt_boxes, rec_res, _ = self.__call__(img, cls)
  File "/opt/conda/lib/python3.10/site-packages/paddleocr/tools/infer/predict_system.py", line 105, in __call__
    rec_res, elapse = self.text_recognizer(img_crop_list)
  File "/opt/conda/lib/python3.10/site-packages/paddleocr/tools/infer/predict_rec.py", line 628, in __call__
    rec_result = self.postprocess_op(preds)
  File "/opt/conda/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 121, in __call__
    text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True)
  File "/opt/conda/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 83, in decode
    char_list = [
  File "/opt/conda/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 84, in <listcomp>
    self.character[text_id]
IndexError: list index out of range

Which I'm guessing is because it's trying to output Chinese, which has an 8000 character dict, whereas English only has 90 or so. Because it says english is supported by the server model (https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/models_list.md), is it possible to get the ppocrv4 server model to output english successfully?

Screen Shot 2024-03-11 at 10 12 15 PM
TingquanGao commented 7 months ago

please try to set lang to 'ch', or use the en_PP-OCRv4_rec model.

ryx2 commented 7 months ago

So what you're saying is that ch_PP-OCRv4_rec_server_infer doesn't actually support English?

tink2123 commented 7 months ago

ch_PP-OCRv4_rec_server_infer supports Chinese + English. lang=en supports pure English, "en" and "ch" correspond to different dictionaries respectively.

https://github.com/PaddlePaddle/PaddleOCR/blob/69832ab5326c6db614af6fb74b530aeae1c9b80e/paddleocr.py#L93-L96

https://github.com/PaddlePaddle/PaddleOCR/blob/69832ab5326c6db614af6fb74b530aeae1c9b80e/paddleocr.py#L88-L91

Model and dictionary need to be consistent, so when you use ch_PP-OCRv4_rec_server_infer , do not modify the lang parameters.

ryx2 commented 6 months ago

Ah I see now, Chinese + English != {Chinese, English}, got it