PaddlePaddle / PaddleHub

Awesome pre-trained models toolkit based on PaddlePaddle. (400+ models including Image, Text, Audio, Video and Cross-Modal with Easy Inference & Serving)
https://www.paddlepaddle.org.cn/hub
Apache License 2.0
12.67k stars 2.07k forks source link

OCR 识别多张图片后会内存不足 #1837

Open tanjh opened 2 years ago

tanjh commented 2 years ago

欢迎您反馈PaddleHub使用问题,非常感谢您对PaddleHub的贡献! 在留下您的问题时,辛苦您同步提供如下信息:

2)系统环境:请您描述系统类型,例如Linux/Windows/MacOS/,python版本 ocr-paddle]# uname -a Linux ecs-iot-prod-01 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

ocr-paddle]# /usr/local/bin/python3.6 --version Python 3.6.5

程序启动后,只占用2G左右,然后升到3G,30分钟左右突升到7G,导致OOM被系统kill。

附:

if name == 'main': MAX_WORKER_NUM = 2 ts = [] for x in range(MAX_WORKER_NUM): t = threading.Thread(target=read_image_from_local, args=(rootpath, cachepath)) t.start() ts.append(t)

PROCESSING_CACHE = {} PROCESS_BATCH_NUM = 5 processed_data = [] def read_image_from_local(path, cachepath): cacheing = [] for f in walk_directory(path): plateNo = f[f.rindex(os.path.sep)+1:f.rindex(".")] if plateNo in PROCESSED_CACHE: continue PROCESSED_CACHE[plateNo] = f

    item = {}
    item["plate_number"] = plateNo
    item["local_vehicle_license_pic"] = f
    # itemlist.append(item)
    # if len(itemlist) < PROCESS_BATCH_NUM:
    #     continue

    try:
        ll = ocr_paddle_hub(item)
        del item
        processed_data.extend(ll)
        for l in ll:
            cacheing.append({"plateNo":l["plateNo"], "type":l["type"]}) 
        del ll        
    except Exception as e:
        traceback.format_exc()

    print("cacheing size: ", len(cacheing))
    if len(cacheing) < PROCESS_BATCH_NUM:
        continue
    save_processed_cache(cachepath, cacheing)
    cacheing = []
    # itemlist = []
    time.sleep(0)

def walk_directory(path): if not os.path.isdir(path): yield path

for subp in os.listdir(path):
    abssubp = os.path.join(os.path.abspath(path), subp)
    if os.path.isdir(abssubp):
        for s in walk_directory(abssubp):
            yield s
    else:
        yield abssubp

def ocr_paddle_hub(item): np_images = [cv2.imread(item["local_vehicle_license_pic"])] results = OCRService.recognize_text( images=np_images, # 图片数据,ndarray.shape 为 [H, W, C],BGR格式; use_gpu=False, # 是否使用 GPU;若使用GPU,请先设置CUDA_VISIBLE_DEVICES环境变量 output_dir='ocr_result_final', # 图片的保存路径,默认设为 ocr_result; visualization=False, # 是否将识别结果保存为图片文件; box_thresh=0.5, # 检测文本框置信度的阈值; text_thresh=0.5) # 识别中文文本置信度的阈值; del np_images vlist = [] for result in results: data = result['data'] veichle = {"type":None} veichle["plateNo"] = item["plate_number"] veichle["local_vehicle_license_pic"] = item["local_vehicle_license_pic"] veichle["save_path"] = result['save_path'] for infomation in data: print('text: ', infomation['text'], '\nconfidence: ', infomation['confidence'], '\ntext_box_position: ', infomation['text_box_position']) if infomation['text'].rstrip().endswith(u"车"):
veichle["type"] = infomation['text'] break if veichle["type"] != None: log.debug("OCR#: "+veichle["plateNo"] + ", " + veichle["type"] + ", " + veichle["save_path"]) vlist.append(veichle) return vlist

tanjh commented 2 years ago

image 上图是服务器监控截图。 可看出内存一直在70%附近波动,到4/18 08:00左右,突然飙升到90%。 注: 该程序从4/17 19:50启动,从本地73805张jpg图片进行OCR识别(每张500k左右)。程序逻辑详见issue描述。

rainyfly commented 2 years ago

processed_data是怎么处理的呢

tanjh commented 2 years ago

processed_data 最后没有使用,一直在缓存数据。他会影响到内存增长。 但不应该是暴增。

rainyfly commented 2 years ago

“程序启动后,只占用2G左右,然后升到3G,30分钟左右突升到7G,导致OOM被系统kill。” 现在是差不多跑半个小时就会被OOM给kill么

tanjh commented 2 years ago

持续时间不定,19:50之前锯齿状的基本就是10分钟左右就OOM被系统kill了。昨天晚上就跑到今天早上5点才被kill。 后面我加上进程监控,发现进程被杀死后再主动拉起来,再加上本地缓存机制才勉强将73805张图片OCR识别完成。 image image

tanjh commented 2 years ago

有人可以帮忙调查下吗?

rainyfly commented 2 years ago

要不把你跑的这个源文件发我们一份,方便我们复现问题进行调试

tanjh commented 2 years ago

代码我都贴出来了啊

rainyfly commented 2 years ago

好,我试试