PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.3k stars 5.62k forks source link

如何infer大数据集? #9823

Closed yttbgf closed 6 years ago

yttbgf commented 6 years ago

是否有类似train/test方法的用生成器函数的方法,并指定batchsize?直接设置input=数据集集合函数 会导致内存急剧增加,似乎又作了一次拷贝,导致内存暴涨。

shenchong721 commented 6 years ago

reader本身是一个迭代器,不会读入内存,infer的时候可以一次读一批数据

    parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path, "r"))
    test_reader = data_reader(data_dir)()

    test_batch = []
    label_batch = []
    for idx, item in enumerate(test_reader):
        test_batch.append(item)
        label_batch.append(item[-2])
        if len(test_batch) == batch_size:
            prediction = paddle.infer(prob_layer, parameters=parameters, input=test_batch, feeding=feeding)
            for i in range(len(prediction)):
                print '\t'.join([str(label_batch[i][0]), str(prediction[i][0])])
            test_batch = []
            label_batch = []

    if len(test_batch):
        prediction = paddle.infer(prob_layer, parameters=parameters, input=test_batch, feeding=feeding)
        for i in range(len(prediction)):
            print '\t'.join([str(label_batch[i][0]), str(prediction[i][0])])
kuke commented 6 years ago

常见做法是对数据的id进行shuffle,而不是将完整数据读入内存然后进行shuffle。

yeyupiaoling commented 6 years ago

@yttbgf 在场景文字识别例子中使用到这个方法: https://github.com/PaddlePaddle/models/blob/6fa8a94bdae3bd99c9b806581af8928c808a1fdf/scene_text_recognition/infer.py#L73-L82

    for i, (image, label
            ) in enumerate(data_generator.infer_reader(infer_file_list)()):
        test_batch.append([image])
        labels.append(label)
        if len(test_batch) == batch_size:
            infer_batch(inferer, test_batch, labels, reversed_char_dict)
            test_batch = []
            labels = []
        if test_batch:
            infer_batch(inferer, test_batch, labels, reversed_char_dict)