BUG：OCR推理多页pdf文件时，设置了page_num参数会出现只识别第一页的情况

PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

https://paddlepaddle.github.io/PaddleOCR/

Apache License 2.0

44.1k stars 7.81k forks source link

BUG：OCR推理多页pdf文件时，设置了page_num参数会出现只识别第一页的情况 #10259

Closed minboo closed 1 year ago

minboo commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Windows和Linux都有此问题
版本号/Version：Paddleocr和paddlepaddle版本都为最新部分代码：

app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_mp=True, total_process_num=4, use_gpu=True, page_num=999, cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer", det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer", rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")

def process_predict(path: str): result= ocr.ocr(path, cls=True) return result

@app.post("/test") async def ocr_rec(file: UploadFile = File(...)):

upload_folder = "input/upload/"
os.makedirs(upload_folder, exist_ok=True)
new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
file_path = os.path.join(upload_folder, new_filename)
with open(file_path, "wb") as buffer:
    shutil.copyfileobj(file.file, buffer)
result = process_predict(file_path)

return {"results": result}



bug复现：先识别一个单页的pdf，再识别一个多页的pdf，此时多页的pdf只能识别第一页

livingbody commented 1 year ago

找到问题，PR中，稍等一秒钟。PR链接：https://github.com/PaddlePaddle/PaddleOCR/pull/10290

livingbody commented 1 year ago

修改及测试地址：飞桨AI Studio - 人工智能学习与实训社区 https://aistudio.baidu.com/aistudio/projectdetail/6474682?contributionType=1

dizhenx commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Windows和Linux都有此问题
版本号/Version：Paddleocr和paddlepaddle版本都为最新部分代码：

app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")

def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现：先识别一个单页的pdf，再识别一个多页的pdf，此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的，换成1.18.14版试试

minboo commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Windows和Linux都有此问题
版本号/Version：Paddleocr和paddlepaddle版本都为最新部分代码：

app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")

def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现：先识别一个单页的pdf，再识别一个多页的pdf，此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的，换成1.18.14版试试

PyMuPDF版本肯定是1.18.14，因为不是这个版本的话识别pdf时会报错 AttributeError: 'Document' object has no attribute 'pageCount'我都有记录的

shiyutang commented 1 year ago

page_num在初始化一个PaddleOCR实例的时候就确定了，每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例？

minboo commented 1 year ago

page_num在初始化一个PaddleOCR实例的时候就确定了，每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例？

每次调用都重新初始化一个实例是非常耗时的，创建实例所需的时间都超过了识别所需的时间，这还怎么用？

minboo commented 1 year ago

page_num在初始化一个PaddleOCR实例的时候就确定了，每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例？

如果每次调用ocr.ocr page_num根据第一次传入的pdf确定了，那么初始化实例时page_num这个参数的意义是什么？这样的操作建议还是修改一下

shiyutang commented 1 year ago

建议尝试下PR，我刚刚看是可以解决问题的，目前已经合入了。

找到问题，PR中，稍等一秒钟。PR链接：#10290

shiyutang commented 1 year ago

以上回答已经充分解答了问题，如果有新的问题欢迎随时提交issue，或者在此条issue下继续回复～我们开启了飞桨套件的ISSUE攻关活动，欢迎感兴趣的开发者参加：https://github.com/PaddlePaddle/PaddleOCR/issues/10223

ColorfulDick commented 6 months ago

我也复现了这个问题，初始化PaddleOCR后，多次输入一个pdf文件，有时会只识别有限的几页

clSpider commented 4 months ago

我也出现了这个问题，多页的pdf如果连续识别，只能识别第一页