PaddlePaddle / PaddleHub

Awesome pre-trained models toolkit based on PaddlePaddle. (400+ models including Image, Text, Audio, Video and Cross-Modal with Easy Inference & Serving)【安全加固,暂停交互,请耐心等待】
https://www.paddlepaddle.org.cn/hub
Apache License 2.0
12.72k stars 2.08k forks source link

paddlehub自定义数据训练时 报错:ZeroDivisionError: float division by zero #904

Closed wcxiaowang closed 3 years ago

wcxiaowang commented 4 years ago

代码如下 import argparse import ast

import paddle.fluid as fluid import paddlehub as hub from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset parser = argparse.ArgumentParser(doc) parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.") parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for fine-tuning, input should be True or False") parser.add_argument("--checkpoint_dir", type=str, default="./aimoli", help="Directory to model checkpoint") parser.add_argument("--max_seq_len", type=int, default=96, help="Number of words of the longest seqence.") parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") args = parser.parse_args() #把parser中设置的所有"add_argument"给返回到args子类实例当中

jieba_paddle = hub.Module(name='jieba_paddle')

class DemoDataset(BaseNLPDataset): """DemoDataset""" def init(self):

数据集存放位置

    self.dataset_dir = r'D:\xampp\htdocs\python\log\data' #"/data/semantic/"
    super(DemoDataset, self).__init__(
        base_path=self.dataset_dir,
        train_file="train_test.tsv",
        dev_file="dev_test.tsv",
        test_file="test_test.tsv",
        # 如果还有预测数据(不需要文本类别label),可以放在predict.tsv
        predict_file="test_test.tsv",
        train_file_with_header=True,
        dev_file_with_header=True,
        test_file_with_header=True,
        # predict_file_with_header=True,
        # 数据集类别集合
        label_list=["0", "1"])

def cut(text): res = jieba_paddle.cut(text, use_paddle=False) return res

if name == 'main':

Load Paddlehub senta pretrained model

module = hub.Module(name="senta_bilstm")
inputs, outputs, program = module.context(
    trainable=True, max_seq_len=args.max_seq_len)

# Tokenizer tokenizes the text data and encodes the data as model needed.
# If you use transformer modules (ernie, bert, roberta and so on), tokenizer should be hub.BertTokenizer.
# Otherwise, tokenizer should be hub.CustomTokenizer.
# If you choose CustomTokenizer, you can also change the chinese word segmentation tool, for example jieba.
tokenizer = hub.CustomTokenizer(
    vocab_file=module.get_vocab_path(), #会返回预训练模型对应的词表
    tokenize_chinese_chars=True,    #是否切分中文文本
    cut_function=cut,  # jieba.cut as cut function
)

#准备自定义的微调数据集
dataset = DemoDataset()
print(dataset)
reader = hub.reader.LACClassifyReader(
    dataset=dataset,
    vocab_path=module.get_vocab_path())
# Construct transfer learning network
# Use sentence-level output. 返回了senta模型对应的句子特征,可以用于句子的特征表达
sent_feature = outputs["sentence_feature"]

#选择优化策略
strategy = hub.AdamWeightDecayStrategy(
    learning_rate=1e-5,
    weight_decay=0.01,
    warmup_proportion=0.1,
    lr_scheduler="linear_decay",
)
# Setup RunConfig for PaddleHub Fine-tune API
config = hub.RunConfig(
    use_cuda=args.use_gpu,
    num_epoch=args.num_epoch,
    batch_size=args.batch_size,
    checkpoint_dir=args.checkpoint_dir,
    strategy=strategy)

# Define a classfication fine-tune task by PaddleHub's API
#构建网络并创建分类迁移任务进行Fine-tune
cls_task = hub.TextClassifierTask(#通过输入特征,label与迁移的类别数,可以生成适用于文本分类的迁移任务TextClassifierTask;
    dataset=dataset,
    feature=sent_feature,
    num_classes=dataset.num_labels,
    config=config)
print('start')
cls_task.finetune_and_eval()

Dataset有值 Dataset: DemoDataset with 17 train examples, 5 dev examples and 2 test examples 格式看起来也是正常的

wcxiaowang commented 4 years ago

但一直报错: image

wcxiaowang commented 4 years ago

如果我将dataset切成下面这种 又一切正常 dataset = hub.dataset.ChnSentiCorp( tokenizer=tokenizer, max_seq_len=args.max_seq_len) 请大佬解释下为啥啊,paddlehub版本是 1.8.2

Steffy-zxf commented 4 years ago

你好!由于PaddleHub 1.8.0版本之后升级了Finetune API。建议使用第二种方式进行迁移学习微调。

dataset = hub.dataset.ChnSentiCorp(tokenizer=tokenizer, max_seq_len=args.max_seq_len)
wcxiaowang commented 4 years ago

那我自己人工整理出来的数据集 怎么作为微调的数据源啊?实际应用场景是 我们会人工整理一批数据集 来校准数据 ,那就只能安装 paddlehub 1.7版是吗?

Steffy-zxf commented 4 years ago

如果你使用paddlehub 1.8版本,请按照1.8使用方式 demo 进行微调。demo:https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.8/demo

如果你使用paddlehub 1.7版本,请按照1.7使用方式 demo 进行微调。demo:https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.7/demo

自定义数据集方式都是一样的,代码如下:

from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset   
class DemoDataset(BaseNLPDataset):  
    """DemoDataset"""   
    def __init__(self): 
        # 数据集存放位置   
        self.dataset_dir = "path/to/dataset"    
        super(DemoDataset, self).__init__(  
            base_path=self.dataset_dir, 
            train_file="train.tsv", 
            dev_file="dev.tsv", 
            test_file="test.tsv",   
            # 如果还有预测数据(不需要文本类别label),可以放在predict.tsv    
            predict_file="predict.tsv", 
            train_file_with_header=True,    
            dev_file_with_header=True,  
            test_file_with_header=True, 
            predict_file_with_header=True,  
            # 数据集类别集合   
            label_list=["0", "1"])  
dataset = DemoDataset() 

参考文档:https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.8/docs/tutorial/how_to_load_data.md

wcxiaowang commented 4 years ago

但是我在1.8版本里 用你上面的自定义数据集的方式 跑 报错 image 这是什么原因啊,代码 我第一次的提问已经贴了

Steffy-zxf commented 4 years ago

PaddleHub 1.8版本增加了tokenizer的用法,所以需要使用

如果我将dataset切成下面这种 又一切正常 dataset = hub.dataset.ChnSentiCorp( tokenizer=tokenizer, max_seq_len=args.max_seq_len)

wcxiaowang commented 4 years ago

不太明白 你这里提供的不是 image 为啥又是 dataset = hub.dataset.ChnSentiCorp(tokenizer=tokenizer, max_seq_len=args.max_seq_len) 自定义的写法能提供下完整的demo吗

Steffy-zxf commented 4 years ago

使用paddlehub 1.8版本,加载自定义数据集时,按照如下代码:

import codecs
import csv

from paddlehub.dataset import InputExample
from paddlehub.dataset.base_nlp_dataset import TextClassificationDataset

class DemoDataset(TextClassificationDataset):
    """
    Demo Dataset
    """
    def __init__(self, tokenizer=None, max_seq_len=None):
        base_path = "path/to/dataset"
        super(DemoDataset, self).__init__(
            base_path=base_path,
            train_file="train.tsv",
            dev_file="dev.tsv",
            test_file="test.tsv",
            label_file=None,
            label_list=["0", "1"],
            tokenizer=tokenizer,
            max_seq_len=max_seq_len)

    def _read_file(self, input_file, phase=None):
        """ 从数据文件中读入数据"""
        with codecs.open(input_file, "r", encoding="UTF-8") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=None)
            examples = []
            seq_id = 0
            header = next(reader)  # skip header
            for line in reader:
                example = InputExample(
                    guid=seq_id, label=line[0], text_a=line[1])
                seq_id += 1
                examples.append(example)

            return examples

Fine-tune demo:

import argparse
import ast
import paddlehub as hub

# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.")
parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
args = parser.parse_args()
# yapf: enable.

jieba_paddle = hub.Module(name='jieba_paddle')

def cut(text):
    res = jieba_paddle.cut(text, use_paddle=False)
    return res

if __name__ == '__main__':

    # Load Paddlehub Senta pretrained model
    module = hub.Module(name="senta_bilstm")
    inputs, outputs, program = module.context(
        trainable=True, max_seq_len=args.max_seq_len)

    # Tokenizer tokenizes the text data and encodes the data as model needed.
    # If you use transformer modules (ernie, bert, roberta and so on), tokenizer should be hub.BertTokenizer.
    # Otherwise, tokenizer should be hub.CustomTokenizer.
    # If you choose CustomTokenizer, you can also change the chinese word segmentation tool, for example jieba.
    tokenizer = hub.CustomTokenizer(
        vocab_file=module.get_vocab_path(),
        tokenize_chinese_chars=True,
        cut_function=cut,  # jieba.cut as cut function
    )

    dataset = DemoDataset(
        tokenizer=tokenizer, max_seq_len=args.max_seq_len)

    # Construct transfer learning network
    # Use sentence-level output.
   sent_feature = outputs["sentence_feature"]

    # Select fine-tune strategy
    strategy = hub.DefaultStrategy(
        optimizer_name="adam", learning_rate=args.learning_rate)

    # Setup RunConfig for PaddleHub Fine-tune API
    config = hub.RunConfig(
        use_cuda=False,
        num_epoch=args.num_epoch,
        batch_size=args.batch_size,
        checkpoint_dir=args.checkpoint_dir,
        strategy=strategy)

    # Define a classfication fine-tune task by PaddleHub's API
    cls_task = hub.TextClassifierTask(
        dataset=dataset,
        feature=sent_feature,
        num_classes=dataset.num_labels,
        config=config,
        metrics_choices=["acc"])

    # Fine-tune and evaluate by PaddleHub's API
    # will finish training, evaluation, testing, save model automatically
    cls_task.finetune_and_eval()

注意:

wcxiaowang commented 4 years ago

非常感谢 ,已经跑通了,大约1000行的训练集,跑完后 预测时,我挑了几个做预测,结果和训练集的结果不一致,时什么原因啊 代码如下: cls_task.finetune_and_eval()

data = ["用了起痘",
        "整体评价:我脸颊总是长痘,前段时间看小红书入手的这款爽肤水,我是每天晚上洗过脸之后湿敷的,之前是买的250ml的,我差不多用了两个多月一点儿,感觉脸颊最近没有再长痘痘了,而且以前脸上老是发痒,最近也很少出现这种情况了,所以这次618我就果断入手的大瓶的,这次还送了好几瓶小瓶的。 保湿控油情况:很好 吸收效果:很好 我的肤质:混油皮 ",
        "广告打的非常好,抱着试一试买来用下,果然对头痒一点效果没有",
        "朋友用了推荐给我的,但是洗发水因人而异吧,我用了后头发是不油了,也很蓬松,但是有了头皮屑,隔天就开始痒。"
        ]
for text in data:
    print(text)
encoded_data = [
    tokenizer.encode(text=text, max_seq_len=args.max_seq_len)
    for text in data
]
label_list = dataset.get_labels()
print(cls_task.predict(data=encoded_data, label_list=label_list))

以上data是从train.tsv里挑的几个 ,但是结果和train里不一致

Steffy-zxf commented 4 years ago

预测的时候,保存的checkpoint文件有加载对吗?预测和训练指定的checkpoint文件保持一致。

haoyuying commented 3 years ago

Since you haven't replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过三个月未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。