OCR 识别多张图片后会内存不足

tanjh commented 2 years ago

欢迎您反馈PaddleHub使用问题，非常感谢您对PaddleHub的贡献！在留下您的问题时，辛苦您同步提供如下信息：

版本、环境信息 1）PaddleHub和PaddlePaddle版本：请提供您的PaddleHub和PaddlePaddle版本号，例如PaddleHub1.4.1，PaddlePaddle1.6.2 ocr-paddle]# paddle --version PaddlePaddle 2.2.1, compiled with with_avx: ON with_gpu: OFF with_mkl: ON with_mkldnn: ON with_python: ON

2）系统环境：请您描述系统类型，例如Linux/Windows/MacOS/，python版本 ocr-paddle]# uname -a Linux ecs-iot-prod-01 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

ocr-paddle]# /usr/local/bin/python3.6 --version Python 3.6.5

复现信息：如为报错，请给出复现环境、复现步骤本地有73805张jpg图片（总共23GB）。开启3个线程处理paddle-hub ocr识别图片。提取识别图片一项内容保存。 ocr-paddle]# ls data/license_image_src/ -l|wc -l 73805 ocr-paddle]# du data/license_image_src/ -sh 23G data/license_image_src/

程序启动后，只占用2G左右，然后升到3G，30分钟左右突升到7G，导致OOM被系统kill。

附：

if name == 'main': MAX_WORKER_NUM = 2 ts = [] for x in range(MAX_WORKER_NUM): t = threading.Thread(target=read_image_from_local, args=(rootpath, cachepath)) t.start() ts.append(t)

PROCESSING_CACHE = {} PROCESS_BATCH_NUM = 5 processed_data = [] def read_image_from_local(path, cachepath): cacheing = [] for f in walk_directory(path): plateNo = f[f.rindex(os.path.sep)+1:f.rindex(".")] if plateNo in PROCESSED_CACHE: continue PROCESSED_CACHE[plateNo] = f

    item = {}
    item["plate_number"] = plateNo
    item["local_vehicle_license_pic"] = f
    # itemlist.append(item)
    # if len(itemlist) < PROCESS_BATCH_NUM:
    #     continue

    try:
        ll = ocr_paddle_hub(item)
        del item
        processed_data.extend(ll)
        for l in ll:
            cacheing.append({"plateNo":l["plateNo"], "type":l["type"]}) 
        del ll        
    except Exception as e:
        traceback.format_exc()

    print("cacheing size: ", len(cacheing))
    if len(cacheing) < PROCESS_BATCH_NUM:
        continue
    save_processed_cache(cachepath, cacheing)
    cacheing = []
    # itemlist = []
    time.sleep(0)

def walk_directory(path): if not os.path.isdir(path): yield path

for subp in os.listdir(path):
    abssubp = os.path.join(os.path.abspath(path), subp)
    if os.path.isdir(abssubp):
        for s in walk_directory(abssubp):
            yield s
    else:
        yield abssubp

def ocr_paddle_hub(item): np_images = [cv2.imread(item["local_vehicle_license_pic"])] results = OCRService.recognize_text( images=np_images, # 图片数据，ndarray.shape 为 [H, W, C]，BGR格式； use_gpu=False, # 是否使用 GPU；若使用GPU，请先设置CUDA_VISIBLE_DEVICES环境变量 output_dir='ocr_result_final', # 图片的保存路径，默认设为 ocr_result； visualization=False, # 是否将识别结果保存为图片文件； box_thresh=0.5, # 检测文本框置信度的阈值； text_thresh=0.5) # 识别中文文本置信度的阈值； del np_images vlist = [] for result in results: data = result['data'] veichle = {"type":None} veichle["plateNo"] = item["plate_number"] veichle["local_vehicle_license_pic"] = item["local_vehicle_license_pic"] veichle["save_path"] = result['save_path'] for infomation in data: print('text: ', infomation['text'], '\nconfidence: ', infomation['confidence'], '\ntext_box_position: ', infomation['text_box_position']) if infomation['text'].rstrip().endswith(u"车"):
veichle["type"] = infomation['text'] break if veichle["type"] != None: log.debug("OCR#： "+veichle["plateNo"] + ", " + veichle["type"] + ", " + veichle["save_path"]) vlist.append(veichle) return vlist

tanjh commented 2 years ago

上图是服务器监控截图。可看出内存一直在70%附近波动，到4/18 08:00左右，突然飙升到90%。注：该程序从4/17 19:50启动，从本地73805张jpg图片进行OCR识别（每张500k左右）。程序逻辑详见issue描述。