BrikerMan / Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
http://kashgari.readthedocs.io/
Apache License 2.0
2.39k stars 441 forks source link

[Question] 使用StackedEmbedding中遇到的问题。 #428

Closed qq2499017550 closed 3 years ago

qq2499017550 commented 3 years ago

You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed. 请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

Environment

[Paste requirements.txt file here]

Question

[A clear and concise description of what you want to know.] from kashgari.corpus import ChineseDailyNerCorpus from read_data2 import _read_data import kashgari from kashgari.embeddings import BERTEmbedding,BareEmbedding,StackedEmbedding,NumericFeaturesEmbedding from kashgari.tasks.labeling import BiLSTM_CRF_Model,BiLSTM_Model,CNN_LSTM_Model import time import numpy import glob from kashgari import callbacks import tensorflow as tf

train = glob.glob('data_location/train///.txt') test = glob.glob('data_location/test///.txt') dev = glob.glob('data_location/dev///*.txt') train_data,train_label,train_path,trainX1,trainY1,trainX2,trainY2 = _read_data(train) dev_data,dev_label,dev_path,devX1,devY1,devX2,devY2= _read_data(dev) test_data,test_label,test_path,testX1,testY1,testX2,testY2 = _read_data(test)

x1_emb = NumericFeaturesEmbedding(feature_count=16,feature_name='x1emb',sequence_length=128) y1_emb = NumericFeaturesEmbedding(feature_count=16,feature_name='y1emb',sequence_length=128) x2_emb = NumericFeaturesEmbedding(feature_count=16,feature_name='x2emb',sequence_length=128) y2_emb = NumericFeaturesEmbedding(feature_count=16,feature_name='y2emb',sequence_length=128)

text_emb = BERTEmbedding('chinese_L-12_H-768_A-12', task=kashgari.LABELING, sequence_length=128,trainable=False) stack_embedding = StackedEmbedding([ text_emb, x1_emb, y1_emb, x2_emb, y2_emb, ])

train_data1 = (train_data,trainX1,trainY1,trainX2,trainY2) dev_data1 = (dev_data,devX1,devY1,devX2,devY2) test_data1 = (test_data,testX1,testY1,testX2,testY2)

model = BiLSTM_Model(embedding=stack_embedding)

model.fit(train_data1,train_label,dev_data1,dev_label,batch_size=64,epochs=10) model.save('position_model') model.evaluate(test_data1,test_label)

您好!我在做火车票的ner任务,实体有始发站,票号,姓名等。 在使用bert+BIlstm时效果还不错,总体平均99召回率。
然后想融入位置信息,使用ocr结果后左上角和右下角的坐标位置。 然后效果变得很差,姓名字段召回率连1都不到。别的字段也下降很多。 且我在训练的时候4000张样本才训练了半个epoch,训练集的loss就下降很低了,已经开始用科学计数的方式表达,acc也100%。
但evaluate结果 确实不对,很差。 实际predict的结果也很差。
我想知道这是怎么回事,我的理解应该不是过学习了,因为训练集的f1,recall,persion也很低, 。 最后感谢您开发的这个框架,确实很方便,简洁!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

GinkgoX commented 3 years ago

你好,请问多标签任务的输入和输出格式是怎样的呀