Closed BrikerMan closed 5 years ago
Tomorrow I will try keras official examples with keras and tf.keras, maybe we will find out why...
Similar to issue https://github.com/BrikerMan/Kashgari/issues/55. Need some help guys, @alexwwang @HaoyuHu
I fixed this bug by using model.fit
rather than using model.fit_generator
. But the issue still remains when using fit_generator, is this commit https://github.com/BrikerMan/Kashgari/commit/761e8f7a87e222bfd2d4827b9407a4cde50f527c ...
https://github.com/BrikerMan/Kashgari/blob/761e8f7a87e222bfd2d4827b9407a4cde50f527c/kashgari/tasks/base_model.py#L124
I fixed this bug by using
model.fit
rather than usingmodel.fit_generator
. But the issue still remains when using fit_generator, is this commit 761e8f7 ...
What's the batch size and epoch value for fit
test?
I fixed this bug by using
model.fit
rather than usingmodel.fit_generator
. But the issue still remains when using fit_generator, is this commit 761e8f7 ... https://github.com/BrikerMan/Kashgari/blob/761e8f7a87e222bfd2d4827b9407a4cde50f527c/kashgari/tasks/base_model.py#L124What's the batch size and epoch value for
fit
test?
512, demo is here https://colab.research.google.com/drive/17KLJtPPOKBudy59wgIUeT1qqjjghAXPV
In fit test, each epoch has 41 batches (20864/512), while in fit_generator test, each epoch contains 64 batches as default, right?
On Mon, Jun 3, 2019 at 22:18 Eliyar Eziz notifications@github.com wrote:
I fixed this bug by using model.fit rather than using model.fit_generator. But the issue still remains when using fit_generator, is this commit 761e8f7 https://github.com/BrikerMan/Kashgari/commit/761e8f7a87e222bfd2d4827b9407a4cde50f527c ...
What's the batch size and epoch value for fit test?
512, demo is here https://colab.research.google.com/drive/17KLJtPPOKBudy59wgIUeT1qqjjghAXPV
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKTP5JIPQE3RJUL2Z7TPYUR3ZA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZRO6Y#issuecomment-498276219, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKX5FZ2EAQ7TQ2XMVVTPYUR3ZANCNFSM4HPW2D6A .
In fit_generator test, each epoch contains 326 batches(20864/64), right?
On Mon, Jun 3, 2019 at 22:25 Alex Wang azure.ww@gmail.com wrote:
In fit test, each epoch has 41 batches (20864/512), while in fit_generator test, each epoch contains 64 batches as default, right?
On Mon, Jun 3, 2019 at 22:18 Eliyar Eziz notifications@github.com wrote:
I fixed this bug by using model.fit rather than using model.fit_generator. But the issue still remains when using fit_generator, is this commit 761e8f7 https://github.com/BrikerMan/Kashgari/commit/761e8f7a87e222bfd2d4827b9407a4cde50f527c ...
What's the batch size and epoch value for fit test?
512, demo is here https://colab.research.google.com/drive/17KLJtPPOKBudy59wgIUeT1qqjjghAXPV
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKTP5JIPQE3RJUL2Z7TPYUR3ZA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZRO6Y#issuecomment-498276219, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKX5FZ2EAQ7TQ2XMVVTPYUR3ZANCNFSM4HPW2D6A .
position
Sorry, this reply should be in #104, moved my comment to #104 .
tf.keras version performs very poorly with the same config, here is my code.
train_x, train_y = ChineseDailyNerCorpus.load_data('train', shuffle=False)
test_x, test_y = ChineseDailyNerCorpus.load_data('test', shuffle=False)
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid', shuffle=False)
train_count = int(len(train_y)*0.1)
test_count = int(len(test_y)*0.1)
valid_count = int(len(valid_x)*0.1)
train_x, train_y = train_x[:train_count], train_y[:train_count]
test_x, test_y = test_x[:test_count], test_y[:test_count]
valid_x, valid_y = valid_x[:valid_count], valid_y[:valid_count]
# tf.keras
embedding = BERTEmbedding('/input0/BERT/chinese_L-12_H-768_A-12',
task=kashgari.LABELING,
sequence_length=100,
layer_nums=1)
# keras
embedding = BERTEmbedding('/input0/BERT/chinese_L-12_H-768_A-12', 100)
model = BLSTMModel(embedding)
model.fit(train_x,
train_y,
valid_x,
valid_y,
batch_size=64,
epochs=10)
model.evaluate(test_x, test_y, batch_size=512)
0.2.4 result
precision recall f1-score support
LOC 0.7268 0.7487 0.7376 199
PER 0.9338 0.9276 0.9307 152
ORG 0.6316 0.7273 0.6761 132
micro avg 0.7598 0.7992 0.7790 483
macro avg 0.7659 0.7992 0.7816 483
tf.keras result
precision recall f1-score support
ORG 0.0065 0.0076 0.0070 132
LOC 0.0485 0.0503 0.0494 199
PER 0.0526 0.0526 0.0526 152
micro avg 0.0371 0.0393 0.0382 483
macro avg 0.0383 0.0393 0.0388 483
@alexwwang
When test full-data without Embedding for 10 epochs, here is the result.
# tf.keras
precision recall f1-score support
PER 0.7842 0.8144 0.7990 1794
ORG 0.5817 0.6850 0.6291 2146
LOC 0.7487 0.7780 0.7631 3428
micro avg 0.7040 0.7598 0.7308 7368
macro avg 0.7087 0.7598 0.7328 7368
# keras
precision recall f1-score support
ORG 0.5975 0.6109 0.6041 2146
LOC 0.7287 0.7695 0.7485 3427
PER 0.7375 0.8449 0.7875 1792
micro avg 0.6943 0.7416 0.7172 7365
macro avg 0.6926 0.7416 0.7159 7365
I noticed that you set layer_nums=1
in TF.keras version, what if set 4?
Also worse?
On Fri, 7 Jun 2019 at 13:47, Eliyar Eziz notifications@github.com wrote:
When test full-data without Embedding for 10 epochs, here is the result.
tf.keras
precision recall f1-score support PER 0.7842 0.8144 0.7990 1794 ORG 0.5817 0.6850 0.6291 2146 LOC 0.7487 0.7780 0.7631 3428
micro avg 0.7040 0.7598 0.7308 7368 macro avg 0.7087 0.7598 0.7328 7368
keras
precision recall f1-score support ORG 0.5975 0.6109 0.6041 2146 LOC 0.7287 0.7695 0.7485 3427 PER 0.7375 0.8449 0.7875 1792
micro avg 0.6943 0.7416 0.7172 7365 macro avg 0.6926 0.7416 0.7159 7365
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKXKPYBJLAR7BDLSHVDPZHZAZA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXE46DY#issuecomment-499765007, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKUYQWBCXI3OXHWITFLPZHZAZANCNFSM4HPW2D6A .
@alexwwang I have tried layer_num=4, it is worse too.
precision recall f1-score support
ORG 0.0145 0.0152 0.0148 132
PER 0.0596 0.0592 0.0594 152
LOC 0.0696 0.0804 0.0746 199
micro avg 0.0520 0.0559 0.0539 483
macro avg 0.0514 0.0559 0.0535 483
It seems a detailed check is needed.
On Fri, Jun 7, 2019, 15:36 Eliyar Eziz notifications@github.com wrote:
@alexwwang https://github.com/alexwwang I have tried layer_num=4, it is worse too.
precision recall f1-score support ORG 0.0145 0.0152 0.0148 132 PER 0.0596 0.0592 0.0594 152 LOC 0.0696 0.0804 0.0746 199
micro avg 0.0520 0.0559 0.0539 483 macro avg 0.0514 0.0559 0.0535 483
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKVKVEVKEN6XUXG4QHDPZIFX7A5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXFCVSY#issuecomment-499788491, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKSPW26AE7O7Y4Q3NC3PZIFX7ANCNFSM4HPW2D6A .
It seems a detailed check is needed.
Yes, I need help here, I have checked several times, still got nothing. @alexwwang
I am working on it.
On Sat, 8 Jun 2019 at 10:51, Eliyar Eziz notifications@github.com wrote:
It seems a detailed check is needed.
Yes, I need help here, I have checked several times, still got nothing. @alexwwang https://github.com/alexwwang
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSYK7AANLIEYDMC5DLPZMNDNA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXHLUNQ#issuecomment-500087350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKRZXRLOSTZFSFN7ZKTPZMNDNANCNFSM4HPW2D6A .
Would you mind do a test as follows? Build up a model with only Bert embedding and a softmax layer but nothing else added up and run the ner task in this two environments?
On Sat, Jun 8, 2019, 10:51 Eliyar Eziz notifications@github.com wrote:
It seems a detailed check is needed.
Yes, I need help here, I have checked several times, still got nothing. @alexwwang https://github.com/alexwwang
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSYK7AANLIEYDMC5DLPZMNDNA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXHLUNQ#issuecomment-500087350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKRZXRLOSTZFSFN7ZKTPZMNDNANCNFSM4HPW2D6A .
Maybe it's a bug in classification_report
. If I use sklearn.metrics.classification_report
instead. The prediction looks fine.
import logging
logging.basicConfig(level=logging.DEBUG)
import kashgari
from kashgari.embeddings import BERTEmbedding
from kashgari.corpus import ChineseDailyNerCorpus
from kashgari.tasks.labeling import BLSTMModel
from sklearn.metrics import classification_report
train_x, train_y = ChineseDailyNerCorpus.load_data('train', shuffle=False)
test_x, test_y = ChineseDailyNerCorpus.load_data('test', shuffle=False)
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid', shuffle=False)
train_count = int(len(train_y)*0.1)
test_count = int(len(test_y)*0.1)
valid_count = int(len(valid_x)*0.1)
train_x, train_y = train_x[:train_count], train_y[:train_count]
test_x, test_y = test_x[:test_count], test_y[:test_count]
valid_x, valid_y = valid_x[:valid_count], valid_y[:valid_count]
embedding = BERTEmbedding('/home/hahahu/projects/models/bert-base-chinese', task=kashgari.LABELING, sequence_length=100, layer_nums=4)
model = BLSTMModel(embedding)
model.fit(train_x, train_y, valid_x, valid_y, batch_size=64, epochs=10)
# model.evaluate(test_x, test_y, batch_size=512, debug_info=True)
y_pred = model.predict(test_x, batch_size=512)
y_true = [seq[:model.embedding.sequence_length] for seq in test_y]
for index in random.sample(list(range(len(test_x))), 5):
logging.debug('------ sample {} ------'.format(index))
logging.debug('x : {}'.format(test_x[index]))
logging.debug('y_true : {}'.format(y_true[index]))
logging.debug('y_pred : {}'.format(y_pred[index]))
print(classification_report(y_true, y_pred, digits=4))
precision recall f1-score support
B-LOC 0.8898 0.9417 0.9150 120
B-ORG 0.8889 0.8511 0.8696 94
B-PER 0.9902 0.9619 0.9758 105
I-LOC 0.8618 0.9298 0.8945 114
I-ORG 0.9011 0.8723 0.8865 94
I-PER 0.9406 0.9500 0.9453 100
O 1.0000 1.0000 1.0000 463
avg / total 0.9489 0.9541 0.9512 1090
import logging
logging.basicConfig(level=logging.DEBUG)
import kashgari
from kashgari.embeddings import BERTEmbedding
from kashgari.corpus import ChinaPeoplesDailyNerCorpus as ChineseDailyNerCorpus
from kashgari.tasks.seq_labeling import BLSTMModel
from sklearn.metrics import classification_report
train_x, train_y = ChineseDailyNerCorpus.get_sequence_tagging_data('train', shuffle=False)
test_x, test_y = ChineseDailyNerCorpus.get_sequence_tagging_data('test', shuffle=False)
valid_x, valid_y = ChineseDailyNerCorpus.get_sequence_tagging_data('valid', shuffle=False)
train_count = int(len(train_y)*0.1)
test_count = int(len(test_y)*0.1)
valid_count = int(len(valid_x)*0.1)
train_x, train_y = train_x[:train_count], train_y[:train_count]
test_x, test_y = test_x[:test_count], test_y[:test_count]
valid_x, valid_y = valid_x[:valid_count], valid_y[:valid_count]
embedding = BERTEmbedding('/home/hahahu/projects/models/bert-base-chinese', 100)
model = BLSTMModel(embedding)
model.fit(train_x, train_y, valid_x, valid_y, batch_size=64, epochs=10)
# model.evaluate(test_x, test_y, batch_size=512, debug_info=True)
y_pred = model.predict(test_x, batch_size=512)
y_true = [seq[:model.embedding.sequence_length] for seq in test_y]
for index in random.sample(list(range(len(test_x))), 5):
logging.debug('------ sample {} ------'.format(index))
logging.debug('x : {}'.format(test_x[index]))
logging.debug('y_true : {}'.format(y_true[index]))
logging.debug('y_pred : {}'.format(y_pred[index]))
print(classification_report(y_true, y_pred, digits=4))
precision recall f1-score support
B-LOC 0.8926 0.9000 0.8963 120
B-ORG 0.8864 0.8298 0.8571 94
B-PER 0.9902 0.9619 0.9758 105
I-LOC 0.8607 0.9211 0.8898 114
I-ORG 0.9412 0.8511 0.8939 94
I-PER 0.9500 0.9500 0.9500 100
O 1.0000 1.0000 1.0000 463
avg / total 0.9532 0.9450 0.9487 1090
BTW: sklearn
of current version doesn't support legacy multi-label data representation. We can use the version 0.16.1 or convert y_pred
and y_true
by sklearn.preprocessing.MultiLabelBinarizer().fit_transform(y)
@HaoyuHu Please help me try this, I have failed to run it with 0.16.1.
from sklearn.metrics import classification_report
y_true = [['O', 'A', 'B']]
y_pred = [['A', 'B', 'O']]
classification_report(y_true, y_pred)
What's wrong with this code fraction in sklearn 0.16.1?
On Sun, 9 Jun 2019 at 13:39, Eliyar Eziz notifications@github.com wrote:
@HaoyuHu https://github.com/HaoyuHu Please help me try this, I have failed to run it with 0.16.1.
from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSZPSVEBJ7ZSCQV26TPZSJQPA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXID37A#issuecomment-500186620, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKVSDHK7GWXOS5JM43TPZSJQPANCNFSM4HPW2D6A .
@HaoyuHu Please help me try this, I have failed to run it with 0.16.1.
from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)
precision recall f1-score support
A 1.00 1.00 1.00 1
B 1.00 1.00 1.00 1
O 1.00 1.00 1.00 1
avg / total 1.00 1.00 1.00 3
Some debug details for tf.keras-version:
DEBUG:root:------ sample 65 ------
DEBUG:root:x : ['一', '些', '企', '业', '原', '本', '不', '生', '产', '干', '红', ',', '所', '以', '既', '无', '稳', '定', '的', '资', '源', ',', '又', '无', '可', '靠', '的', '技', '术', ',', '更', '无', '足', '够', '的', '资', '本', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 433 ------
DEBUG:root:x : ['至', '于', '女', '双', ',', '葛', '菲', '/', '顾', '俊', '近', '几', '年', '一', '直', '是', '打', '遍', '天', '下', '无', '敌', '手', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 440 ------
DEBUG:root:x : ['这', '次', '运', '动', '会', '共', '设', '有', '6', '0', '个', '比', '赛', '项', '目', ',', '其', '中', '包', '括', '消', '防', '类', '体', '育', '项', '目', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 301 ------
DEBUG:root:x : ['回', '顾', '过', '去', '的', '艰', '难', '历', '程', ',', '井', '陉', '人', '深', '深', '感', '到', ':', '开', '发', '特', '色', '农', '业', ',', '实', '施', '名', '牌', '战', '略', ',', '是', '一', '项', '系', '统', '工', '程', ',', '需', '要', '政', '府', '引', '导', '、', '部', '门', '协', '作', '和', '农', '民', '群', '众', '的', '广', '泛', '参', '与', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 450 ------
DEBUG:root:x : ['由', '于', '上', '届', '世', '锦', '赛', '战', '绩', '不', '佳', ',', '俄', '罗', '斯', '队', '主', '教', '练', '戈', '麦', '尔', '斯', '基', '将', '每', '一', '个', '对', '手', '都', '视', '为', '劲', '敌', ',', '他', '特', '别', '提', '到', '明', '天', '首', '场', '对', '中', '国', '队', '的', '比', '赛', '会', '很', '艰', '难', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
y_pred
and y_true
is almost the same.
@HaoyuHu Please help me try this, I have failed to run it with 0.16.1.
from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)
precision recall f1-score support A 1.00 1.00 1.00 1 B 1.00 1.00 1.00 1 O 1.00 1.00 1.00 1 avg / total 1.00 1.00 1.00 3
This is wrong... For a Sequence Labeling task, this sample's accuracy and recall both should be zero.
So it's inorder in a multi-label prediction output without considering the confidence of each result.
Back to the original problem, the ner task, is it a multi-label task or calculating the most possible result for each character?
On Sun, 9 Jun 2019 at 13:52, hahahu notifications@github.com wrote:
@HaoyuHu https://github.com/HaoyuHu Please help me try this, I have failed to run it with 0.16.1.
from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)
precision recall f1-score support A 1.00 1.00 1.00 1 B 1.00 1.00 1.00 1 O 1.00 1.00 1.00 1
avg / total 1.00 1.00 1.00 3
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKQRYLI5RQL6XRVUWB3PZSLBTA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXID7ZA#issuecomment-500187108, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKTRYLXB4MLCRMP5GQDPZSLBTANCNFSM4HPW2D6A .
So it's inorder in a multi-label prediction output without considering the confidence of each result. Back to the original problem, the ner task, is it a multi-label task or calculating the most possible result for each character?
In the labeling task, we should calculate the most possible result for each input token, labeling order matters.
I guess it is caused by the data structure input to the classification report. I'll check the documents later to figure out it.
On Sun, 9 Jun 2019 at 13:56, Eliyar Eziz notifications@github.com wrote:
@HaoyuHu https://github.com/HaoyuHu Please help me try this, I have failed to run it with 0.16.1.
from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)
precision recall f1-score support A 1.00 1.00 1.00 1 B 1.00 1.00 1.00 1 O 1.00 1.00 1.00 1
avg / total 1.00 1.00 1.00 3
This is wrong... For a Sequence Labeling task, this sample's accuracy and recall both should be zero.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSNT3OCRGITNKAKSO3PZSLP5A5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIEBGI#issuecomment-500187289, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKQADE7OEZPMSP4IY53PZSLP5ANCNFSM4HPW2D6A .
I'm sorry about using sklearn.metrics.classification_report
here. The random sample output is almost right, but the result of classification_report
is weird.
I'm sorry about using
sklearn.metrics.classification_report
here. The random sample output is almost right, but the result ofclassification_report
is weird.
That's cool man, we need to keep trying new ways pinpoint the bug.
Yes, but the data structure of input to classification report seems to indicate a multi-label prediction and original set comparison.
On Sun, 9 Jun 2019 at 14:01, Eliyar Eziz notifications@github.com wrote:
So it's inorder in a multi-label prediction output without considering the confidence of each result. Back to the original problem, the ner task, is it a multi-label task or calculating the most possible result for each character?
In the labeling task, we should calculate the most possible result for each input token, labeling order matters.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKXLQSHKIAAJHOCOFGTPZSMDNA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIEDBI#issuecomment-500187525, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKSNEBVENDRJ3RK4YQTPZSMDNANCNFSM4HPW2D6A .
This is a piece of good news! So it's of high probability to exist some bugs in accurateness evaluation.
On Sun, 9 Jun 2019 at 13:54, hahahu notifications@github.com wrote:
y_pred and y_true is almost the same.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKWMENZWRK5MR22TTKDPZSLIRA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIEAPI#issuecomment-500187197, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKQBB5VWRPZXFIDICLLPZSLIRANCNFSM4HPW2D6A .
I have tried to build BERT-BLSTM model from the scratch, it works fine with tf 1.13.1 and tf 2.0 beta. It works just fine.
I am confused with sevqeval.classification_report now. How do you mean by just fine? Evaluating with seqeval.classification_report?
On Mon, Jun 10, 2019, 22:49 Eliyar Eziz notifications@github.com wrote:
I have tried to build BERT-BLSTM model from the scratch, it works fine with tf 1.13.1 and tf 2.0 beta. It works just fine.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKUK2DIHXE25L2UB6P3PZZSXHA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXKC6FQ#issuecomment-500444950, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKQ4722OAGAQZ2HRPELPZZSXHANCNFSM4HPW2D6A .
from seqeval.metrics import classification_report
print(classification_report(y_true, y_pred))
precision recall f1-score support
LOC 0.86 0.88 0.87 3431
ORG 0.73 0.84 0.79 2148
PER 0.92 0.93 0.92 1798
micro avg 0.84 0.88 0.86 7377
macro avg 0.84 0.88 0.86 7377
I have tried to build BERT-BLSTM model from the scratch, it works fine with tf 1.13.1 and tf 2.0 beta. It works just fine.
Is there any difference between the sample code above and your built BERT-BLSTM model from the scratch? Does this mean that it is not a problem with classification_report
? :+1:
Hi guys, I pinpointed the issue. In BERT embedding, I have added <BOS>
and <EOS>
token to the sequence. Then when reversing index to labels, I have removed the <BOS>
<EOS>
token, during this process, for samples which are longer than sequence_length
, it makes the len(y_true[x]) != len(y_pred[x])
. Which will cause a very poor result from classification_report.
Possible fix, I have tried this in notebook.
y_pred = model.predict(test_x)
y_true = [seq[:len(y_pred[index])] for index, seq in enumerate(test_y)]
print(classification_report(y_true, y_pred))
result look good
precision recall f1-score support
PER 0.92 0.93 0.92 152
LOC 0.73 0.76 0.75 199
ORG 0.64 0.74 0.69 132
micro avg 0.76 0.81 0.79 483
macro avg 0.77 0.81 0.79 483
Good news guys, fixed. After 10 epoch with 64 batch-size, here is the result.
embedding = BERTEmbedding('/input0/BERT/chinese_L-12_H-768_A-12',
task=kashgari.LABELING,
sequence_length=100,
layer_nums=4)
model = BLSTMModel(embedding)
model.fit(train_x,
train_y,
valid_x,
valid_y,
batch_size=64,
epochs=10)
model.evaluate(test_x, test_y, batch_size=512)
precision recall f1-score support
LOC 0.9265 0.9370 0.9317 3431
ORG 0.8364 0.8808 0.8580 2147
PER 0.9644 0.9644 0.9644 1797
micro avg 0.9084 0.9273 0.9177 7375
macro avg 0.9095 0.9273 0.9182 7375
Guys. Finally fixed fit_generator issue. Now we could change the fit_with_generator as the default method for memory saving.
Cheers!
Good to know!
On Wed, 10 Jul 2019 at 18:30, hahahu notifications@github.com wrote:
Cheers!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKQIN5YQXKLEPXJL3MDP6W227A5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZTBJIY#issuecomment-510006435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKUNPRPDG2C5D4V56STP6W227ANCNFSM4HPW2D6A .
Check List
Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.
Environment
Issue Description
I have tried 0.2.1 version and tf.keras version for ChineseNER task, found that tf.keras version perform very badly. 0.21 val loss will reduce during training, but tf.keras only reduce the training loss.
What
0.2.1 perfomance
tf.keras performance
Reproduce
Here is the colab notebook for reproduce this issue
tf.keras-colab
0.2.1 colab