PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
44.1k stars 7.81k forks source link

BUG:OCR推理多页pdf文件时,设置了page_num参数会出现只识别第一页的情况 #10259

Closed minboo closed 1 year ago

minboo commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_mp=True, total_process_num=4, use_gpu=True, page_num=999, cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer", det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer", rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")

def process_predict(path: str): result= ocr.ocr(path, cls=True) return result

@app.post("/test") async def ocr_rec(file: UploadFile = File(...)):

upload_folder = "input/upload/"
os.makedirs(upload_folder, exist_ok=True)
new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
file_path = os.path.join(upload_folder, new_filename)
with open(file_path, "wb") as buffer:
    shutil.copyfileobj(file.file, buffer)
result = process_predict(file_path)

return {"results": result}


bug复现:先识别一个单页的pdf,再识别一个多页的pdf,此时多页的pdf只能识别第一页
livingbody commented 1 year ago

找到问题,PR中,稍等一秒钟。PR链接:https://github.com/PaddlePaddle/PaddleOCR/pull/10290

livingbody commented 1 year ago

修改及测试地址: 飞桨AI Studio - 人工智能学习与实训社区 https://aistudio.baidu.com/aistudio/projectdetail/6474682?contributionType=1

dizhenx commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

  • 系统环境/System Environment:Windows和Linux都有此问题
  • 版本号/Version:Paddleocr和paddlepaddle版本都为最新 部分代码:
app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")

def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现:先识别一个单页的pdf,再识别一个多页的pdf,此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的,换成1.18.14版试试

minboo commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

  • 系统环境/System Environment:Windows和Linux都有此问题
  • 版本号/Version:Paddleocr和paddlepaddle版本都为最新 部分代码:
app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")

def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现:先识别一个单页的pdf,再识别一个多页的pdf,此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的,换成1.18.14版试试

PyMuPDF版本肯定是1.18.14,因为不是这个版本的话识别pdf时会报错 AttributeError: 'Document' object has no attribute 'pageCount'我都有记录的 image

shiyutang commented 1 year ago

page_num在初始化一个PaddleOCR实例的时候就确定了,每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例?

minboo commented 1 year ago

page_num在初始化一个PaddleOCR实例的时候就确定了,每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例?

每次调用都重新初始化一个实例是非常耗时的,创建实例所需的时间都超过了识别所需的时间,这还怎么用?

minboo commented 1 year ago

page_num在初始化一个PaddleOCR实例的时候就确定了,每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例?

如果每次调用ocr.ocr page_num根据第一次传入的pdf确定了,那么初始化实例时page_num这个参数的意义是什么?这样的操作建议还是修改一下

shiyutang commented 1 year ago

建议尝试下PR,我刚刚看是可以解决问题的,目前已经合入了。

找到问题,PR中,稍等一秒钟。PR链接:#10290

shiyutang commented 1 year ago

以上回答已经充分解答了问题,如果有新的问题欢迎随时提交issue,或者在此条issue下继续回复~ 我们开启了飞桨套件的ISSUE攻关活动,欢迎感兴趣的开发者参加:https://github.com/PaddlePaddle/PaddleOCR/issues/10223

ColorfulDick commented 6 months ago

我也复现了这个问题,初始化PaddleOCR后,多次输入一个pdf文件,有时会只识别有限的几页

clSpider commented 4 months ago

我也出现了这个问题,多页的pdf如果连续识别,只能识别第一页