PaddlePaddle / PaddleHub

Awesome pre-trained models toolkit based on PaddlePaddle. (400+ models including Image, Text, Audio, Video and Cross-Modal with Easy Inference & Serving)【安全加固,暂停交互,请耐心等待】
https://www.paddlepaddle.org.cn/hub
Apache License 2.0
12.74k stars 2.07k forks source link

当验证/测试数据集比较小的时候,设置的batch_size过大会导致IndexError: list index out of range #1020

Open youpanpan opened 4 years ago

youpanpan commented 4 years ago

版本信息: PaddleHub1.8.2, PaddlePaddle1.8.5,Python3.7.2 系统环境:Windows10 问题描述: 当我使用PaddleHub进行Fine-tune的时候,使用自定义的数据集,数据量比较少,训练数据:30条,测试、验证数据:37条 通过hub.RunConfig配置的代码是: self.config = hub.RunConfig( use_cuda=False, num_epoch=10, batch_size=10, eval_interval=10, strategy=self.strategy) 然后执行task.finetune_and_eval()就出现 File "C:\software\work\Python3.7\lib\site-packages\paddlehub\finetune\task\classifier_task.py", line 117, in _calculate_metrics     run_time_used = time.time() - run_states[0].run_time_begin IndexError: list index out of range 当更改batch_size为6及以下时就正常执行 参考AiStudio项目:PaddleHub之《青春有你2》进行二分类 完整代码如下:

#coding:utf-8

import os
import json
import paddlehub as hub
from PIL import Image
import matplotlib.pyplot as plt

from paddlehub.dataset.base_cv_dataset import BaseCVDataset

# 加载自定义数据集
class StarDataset(BaseCVDataset):
    def __init__(self):

        # 数据集存放位置
        base_path =  "Python小白逆袭大神/青春有你2选手图片分类/dataset"

        super(StarDataset, self).__init__(
            base_path=base_path,
            train_list_file='train_list.txt',
            validate_list_file='validate_list.txt',
            test_list_file='test_list.txt',
            label_list_file='label_list.txt',
            label_list=None
        )

class StarFinetuneTask(object):
    def __init__(self):
        self.dataset = StarDataset()

        # 加载预训练模型
        self.module = hub.Module(name = 'mobilenet_v2_imagenet')

        # 配置数据预处理器
        self.reader = hub.reader.ImageClassificationReader(
            image_width=self.module.get_expected_image_width(),
            image_height=self.module.get_expected_image_height(),
            images_mean=self.module.get_pretrained_images_mean(),
            images_std=self.module.get_pretrained_images_std(),
            dataset=self.dataset)

        # 选择优化策略
        self.strategy = hub.DefaultFinetuneStrategy()

        # 设置运行时配置
        self.config = hub.RunConfig(
            use_cuda=False,
            num_epoch=10,
            batch_size=10,
            eval_interval=10,
            strategy=self.strategy)

        # 组建训练任务
        input_dict, output_dict, program = self.module.context(trainable=True)
        img = input_dict["image"]
        feature_map = output_dict["feature_map"]
        feed_list = [img.name]

        self.task = hub.ImageClassifierTask(
            data_reader=self.reader,
            feature=feature_map,
            feed_list=feed_list,
            num_classes=self.dataset.num_labels,
            config=self.config)

    def finetune_and_eval(self):
        # 启动Fine-tune
        self.task.finetune_and_eval()

task = StarFinetuneTask()
task.finetune_and_eval()
haoyuying commented 4 years ago

数据量过少导致不够分配,建议使用GPU跑或者将CPU_NUM 设置成1。