Closed wcxiaowang closed 3 years ago
但一直报错:
如果我将dataset切成下面这种 又一切正常 dataset = hub.dataset.ChnSentiCorp( tokenizer=tokenizer, max_seq_len=args.max_seq_len) 请大佬解释下为啥啊,paddlehub版本是 1.8.2
你好!由于PaddleHub 1.8.0版本之后升级了Finetune API。建议使用第二种方式进行迁移学习微调。
dataset = hub.dataset.ChnSentiCorp(tokenizer=tokenizer, max_seq_len=args.max_seq_len)
那我自己人工整理出来的数据集 怎么作为微调的数据源啊?实际应用场景是 我们会人工整理一批数据集 来校准数据 ,那就只能安装 paddlehub 1.7版是吗?
如果你使用paddlehub 1.8版本,请按照1.8使用方式 demo 进行微调。demo:https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.8/demo
如果你使用paddlehub 1.7版本,请按照1.7使用方式 demo 进行微调。demo:https://github.com/PaddlePaddle/PaddleHub/tree/release/v1.7/demo
自定义数据集方式都是一样的,代码如下:
from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset
class DemoDataset(BaseNLPDataset):
"""DemoDataset"""
def __init__(self):
# 数据集存放位置
self.dataset_dir = "path/to/dataset"
super(DemoDataset, self).__init__(
base_path=self.dataset_dir,
train_file="train.tsv",
dev_file="dev.tsv",
test_file="test.tsv",
# 如果还有预测数据(不需要文本类别label),可以放在predict.tsv
predict_file="predict.tsv",
train_file_with_header=True,
dev_file_with_header=True,
test_file_with_header=True,
predict_file_with_header=True,
# 数据集类别集合
label_list=["0", "1"])
dataset = DemoDataset()
参考文档:https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.8/docs/tutorial/how_to_load_data.md
但是我在1.8版本里 用你上面的自定义数据集的方式 跑 报错 这是什么原因啊,代码 我第一次的提问已经贴了
PaddleHub 1.8版本增加了tokenizer的用法,所以需要使用
如果我将dataset切成下面这种 又一切正常 dataset = hub.dataset.ChnSentiCorp( tokenizer=tokenizer, max_seq_len=args.max_seq_len)
不太明白 你这里提供的不是 为啥又是 dataset = hub.dataset.ChnSentiCorp(tokenizer=tokenizer, max_seq_len=args.max_seq_len) 自定义的写法能提供下完整的demo吗
使用paddlehub 1.8版本,加载自定义数据集时,按照如下代码:
import codecs
import csv
from paddlehub.dataset import InputExample
from paddlehub.dataset.base_nlp_dataset import TextClassificationDataset
class DemoDataset(TextClassificationDataset):
"""
Demo Dataset
"""
def __init__(self, tokenizer=None, max_seq_len=None):
base_path = "path/to/dataset"
super(DemoDataset, self).__init__(
base_path=base_path,
train_file="train.tsv",
dev_file="dev.tsv",
test_file="test.tsv",
label_file=None,
label_list=["0", "1"],
tokenizer=tokenizer,
max_seq_len=max_seq_len)
def _read_file(self, input_file, phase=None):
""" 从数据文件中读入数据"""
with codecs.open(input_file, "r", encoding="UTF-8") as f:
reader = csv.reader(f, delimiter="\t", quotechar=None)
examples = []
seq_id = 0
header = next(reader) # skip header
for line in reader:
example = InputExample(
guid=seq_id, label=line[0], text_a=line[1])
seq_id += 1
examples.append(example)
return examples
Fine-tune demo:
import argparse
import ast
import paddlehub as hub
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.")
parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
parser.add_argument("--checkpoint_dir", type=str, default=None, help="Directory to model checkpoint")
parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
args = parser.parse_args()
# yapf: enable.
jieba_paddle = hub.Module(name='jieba_paddle')
def cut(text):
res = jieba_paddle.cut(text, use_paddle=False)
return res
if __name__ == '__main__':
# Load Paddlehub Senta pretrained model
module = hub.Module(name="senta_bilstm")
inputs, outputs, program = module.context(
trainable=True, max_seq_len=args.max_seq_len)
# Tokenizer tokenizes the text data and encodes the data as model needed.
# If you use transformer modules (ernie, bert, roberta and so on), tokenizer should be hub.BertTokenizer.
# Otherwise, tokenizer should be hub.CustomTokenizer.
# If you choose CustomTokenizer, you can also change the chinese word segmentation tool, for example jieba.
tokenizer = hub.CustomTokenizer(
vocab_file=module.get_vocab_path(),
tokenize_chinese_chars=True,
cut_function=cut, # jieba.cut as cut function
)
dataset = DemoDataset(
tokenizer=tokenizer, max_seq_len=args.max_seq_len)
# Construct transfer learning network
# Use sentence-level output.
sent_feature = outputs["sentence_feature"]
# Select fine-tune strategy
strategy = hub.DefaultStrategy(
optimizer_name="adam", learning_rate=args.learning_rate)
# Setup RunConfig for PaddleHub Fine-tune API
config = hub.RunConfig(
use_cuda=False,
num_epoch=args.num_epoch,
batch_size=args.batch_size,
checkpoint_dir=args.checkpoint_dir,
strategy=strategy)
# Define a classfication fine-tune task by PaddleHub's API
cls_task = hub.TextClassifierTask(
dataset=dataset,
feature=sent_feature,
num_classes=dataset.num_labels,
config=config,
metrics_choices=["acc"])
# Fine-tune and evaluate by PaddleHub's API
# will finish training, evaluation, testing, save model automatically
cls_task.finetune_and_eval()
注意:
非常感谢 ,已经跑通了,大约1000行的训练集,跑完后 预测时,我挑了几个做预测,结果和训练集的结果不一致,时什么原因啊 代码如下: cls_task.finetune_and_eval()
data = ["用了起痘",
"整体评价:我脸颊总是长痘,前段时间看小红书入手的这款爽肤水,我是每天晚上洗过脸之后湿敷的,之前是买的250ml的,我差不多用了两个多月一点儿,感觉脸颊最近没有再长痘痘了,而且以前脸上老是发痒,最近也很少出现这种情况了,所以这次618我就果断入手的大瓶的,这次还送了好几瓶小瓶的。 保湿控油情况:很好 吸收效果:很好 我的肤质:混油皮 ",
"广告打的非常好,抱着试一试买来用下,果然对头痒一点效果没有",
"朋友用了推荐给我的,但是洗发水因人而异吧,我用了后头发是不油了,也很蓬松,但是有了头皮屑,隔天就开始痒。"
]
for text in data:
print(text)
encoded_data = [
tokenizer.encode(text=text, max_seq_len=args.max_seq_len)
for text in data
]
label_list = dataset.get_labels()
print(cls_task.predict(data=encoded_data, label_list=label_list))
以上data是从train.tsv里挑的几个 ,但是结果和train里不一致
预测的时候,保存的checkpoint文件有加载对吗?预测和训练指定的checkpoint文件保持一致。
Since you haven't replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过三个月未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。
代码如下 import argparse import ast
import paddle.fluid as fluid import paddlehub as hub from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset parser = argparse.ArgumentParser(doc) parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.") parser.add_argument("--use_gpu", type=ast.literal_eval, default=False, help="Whether use GPU for fine-tuning, input should be True or False") parser.add_argument("--checkpoint_dir", type=str, default="./aimoli", help="Directory to model checkpoint") parser.add_argument("--max_seq_len", type=int, default=96, help="Number of words of the longest seqence.") parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.") args = parser.parse_args() #把parser中设置的所有"add_argument"给返回到args子类实例当中
jieba_paddle = hub.Module(name='jieba_paddle')
class DemoDataset(BaseNLPDataset): """DemoDataset""" def init(self):
数据集存放位置
def cut(text): res = jieba_paddle.cut(text, use_paddle=False) return res
if name == 'main':
Load Paddlehub senta pretrained model
Dataset有值 Dataset: DemoDataset with 17 train examples, 5 dev examples and 2 test examples 格式看起来也是正常的