BrikerMan / Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
http://kashgari.readthedocs.io/
Apache License 2.0
2.39k stars 441 forks source link

[BUG] [tf.keras] BLSTM NER overfitting while 0.2.1 works just fine #96

Closed BrikerMan closed 5 years ago

BrikerMan commented 5 years ago

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

Environment

Issue Description

I have tried 0.2.1 version and tf.keras version for ChineseNER task, found that tf.keras version perform very badly. 0.21 val loss will reduce during training, but tf.keras only reduce the training loss.

What

0.2.1 perfomance

Epoch 1/200
41/41 [==============================] - 159s 4s/step - loss: 0.2313 - acc: 0.9385 - val_loss: 0.0699 - val_acc: 0.9772
Epoch 2/200
41/41 [==============================] - 277s 7s/step - loss: 0.0563 - acc: 0.9823 - val_loss: 0.0356 - val_acc: 0.9892
Epoch 3/200
41/41 [==============================] - 309s 8s/step - loss: 0.0361 - acc: 0.9887 - val_loss: 0.0243 - val_acc: 0.9928
Epoch 4/200
41/41 [==============================] - 242s 6s/step - loss: 0.0297 - acc: 0.9905 - val_loss: 0.0228 - val_acc: 0.9927
Epoch 5/200
41/41 [==============================] - 328s 8s/step - loss: 0.0252 - acc: 0.9920 - val_loss: 0.0196 - val_acc: 0.9938
Epoch 6/200
 4/41 [=>............................] - ETA: 4:37 - loss: 0.0234 - acc: 0.9926

tf.keras performance

Epoch 1/200
Epoch 1/200
5/5 [==============================] - 5s 1s/step - loss: 2.3491 - acc: 0.9712
42/42 [==============================] - 115s 3s/step - loss: 2.9824 - acc: 0.9171 - val_loss: 2.3491 - val_acc: 0.9712
Epoch 2/200
5/5 [==============================] - 4s 768ms/step - loss: 2.9726 - acc: 0.9822
42/42 [==============================] - 107s 3s/step - loss: 0.1563 - acc: 0.9952 - val_loss: 2.9726 - val_acc: 0.9822
Epoch 3/200
5/5 [==============================] - 4s 773ms/step - loss: 3.0985 - acc: 0.9833
42/42 [==============================] - 107s 3s/step - loss: 0.0482 - acc: 0.9994 - val_loss: 3.0985 - val_acc: 0.9833
Epoch 4/200
5/5 [==============================] - 4s 771ms/step - loss: 3.2479 - acc: 0.9833
42/42 [==============================] - 107s 3s/step - loss: 0.0247 - acc: 0.9997 - val_loss: 3.2479 - val_acc: 0.9833
Epoch 5/200
5/5 [==============================] - 4s 766ms/step - loss: 3.3612 - acc: 0.9839
42/42 [==============================] - 107s 3s/step - loss: 0.0156 - acc: 0.9998 - val_loss: 3.3612 - val_acc: 0.9839

Reproduce

Here is the colab notebook for reproduce this issue

BrikerMan commented 5 years ago

Tomorrow I will try keras official examples with keras and tf.keras, maybe we will find out why...

BrikerMan commented 5 years ago

Similar to issue https://github.com/BrikerMan/Kashgari/issues/55. Need some help guys, @alexwwang @HaoyuHu

BrikerMan commented 5 years ago

I fixed this bug by using model.fit rather than using model.fit_generator. But the issue still remains when using fit_generator, is this commit https://github.com/BrikerMan/Kashgari/commit/761e8f7a87e222bfd2d4827b9407a4cde50f527c ... https://github.com/BrikerMan/Kashgari/blob/761e8f7a87e222bfd2d4827b9407a4cde50f527c/kashgari/tasks/base_model.py#L124

alexwwang commented 5 years ago

I fixed this bug by using model.fit rather than using model.fit_generator. But the issue still remains when using fit_generator, is this commit 761e8f7 ...

https://github.com/BrikerMan/Kashgari/blob/761e8f7a87e222bfd2d4827b9407a4cde50f527c/kashgari/tasks/base_model.py#L124

What's the batch size and epoch value for fit test?

BrikerMan commented 5 years ago

I fixed this bug by using model.fit rather than using model.fit_generator. But the issue still remains when using fit_generator, is this commit 761e8f7 ... https://github.com/BrikerMan/Kashgari/blob/761e8f7a87e222bfd2d4827b9407a4cde50f527c/kashgari/tasks/base_model.py#L124

What's the batch size and epoch value for fit test?

512, demo is here https://colab.research.google.com/drive/17KLJtPPOKBudy59wgIUeT1qqjjghAXPV

alexwwang commented 5 years ago

In fit test, each epoch has 41 batches (20864/512), while in fit_generator test, each epoch contains 64 batches as default, right?

On Mon, Jun 3, 2019 at 22:18 Eliyar Eziz notifications@github.com wrote:

I fixed this bug by using model.fit rather than using model.fit_generator. But the issue still remains when using fit_generator, is this commit 761e8f7 https://github.com/BrikerMan/Kashgari/commit/761e8f7a87e222bfd2d4827b9407a4cde50f527c ...

https://github.com/BrikerMan/Kashgari/blob/761e8f7a87e222bfd2d4827b9407a4cde50f527c/kashgari/tasks/base_model.py#L124

What's the batch size and epoch value for fit test?

512, demo is here https://colab.research.google.com/drive/17KLJtPPOKBudy59wgIUeT1qqjjghAXPV

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKTP5JIPQE3RJUL2Z7TPYUR3ZA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZRO6Y#issuecomment-498276219, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKX5FZ2EAQ7TQ2XMVVTPYUR3ZANCNFSM4HPW2D6A .

alexwwang commented 5 years ago

In fit_generator test, each epoch contains 326 batches(20864/64), right?

On Mon, Jun 3, 2019 at 22:25 Alex Wang azure.ww@gmail.com wrote:

In fit test, each epoch has 41 batches (20864/512), while in fit_generator test, each epoch contains 64 batches as default, right?

On Mon, Jun 3, 2019 at 22:18 Eliyar Eziz notifications@github.com wrote:

I fixed this bug by using model.fit rather than using model.fit_generator. But the issue still remains when using fit_generator, is this commit 761e8f7 https://github.com/BrikerMan/Kashgari/commit/761e8f7a87e222bfd2d4827b9407a4cde50f527c ...

https://github.com/BrikerMan/Kashgari/blob/761e8f7a87e222bfd2d4827b9407a4cde50f527c/kashgari/tasks/base_model.py#L124

What's the batch size and epoch value for fit test?

512, demo is here https://colab.research.google.com/drive/17KLJtPPOKBudy59wgIUeT1qqjjghAXPV

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKTP5JIPQE3RJUL2Z7TPYUR3ZA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZRO6Y#issuecomment-498276219, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKX5FZ2EAQ7TQ2XMVVTPYUR3ZANCNFSM4HPW2D6A .

BrikerMan commented 5 years ago

position

Sorry, this reply should be in #104, moved my comment to #104 .

BrikerMan commented 5 years ago

tf.keras version performs very poorly with the same config, here is my code.

train_x, train_y = ChineseDailyNerCorpus.load_data('train', shuffle=False)
test_x, test_y = ChineseDailyNerCorpus.load_data('test', shuffle=False)
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid', shuffle=False)

train_count = int(len(train_y)*0.1)
test_count = int(len(test_y)*0.1)
valid_count = int(len(valid_x)*0.1)

train_x, train_y = train_x[:train_count], train_y[:train_count]
test_x, test_y = test_x[:test_count], test_y[:test_count]
valid_x, valid_y = valid_x[:valid_count], valid_y[:valid_count]

# tf.keras
embedding = BERTEmbedding('/input0/BERT/chinese_L-12_H-768_A-12',
                          task=kashgari.LABELING,
                          sequence_length=100,
                          layer_nums=1)
# keras
embedding = BERTEmbedding('/input0/BERT/chinese_L-12_H-768_A-12', 100)

model = BLSTMModel(embedding)
model.fit(train_x,
          train_y,
          valid_x,
          valid_y,
          batch_size=64,
          epochs=10)

model.evaluate(test_x, test_y, batch_size=512)

0.2.4 result

           precision    recall  f1-score   support

      LOC     0.7268    0.7487    0.7376       199
      PER     0.9338    0.9276    0.9307       152
      ORG     0.6316    0.7273    0.6761       132

micro avg     0.7598    0.7992    0.7790       483
macro avg     0.7659    0.7992    0.7816       483

tf.keras result

           precision    recall  f1-score   support

      ORG     0.0065    0.0076    0.0070       132
      LOC     0.0485    0.0503    0.0494       199
      PER     0.0526    0.0526    0.0526       152

micro avg     0.0371    0.0393    0.0382       483
macro avg     0.0383    0.0393    0.0388       483

@alexwwang

BrikerMan commented 5 years ago

When test full-data without Embedding for 10 epochs, here is the result.

# tf.keras

           precision    recall  f1-score   support

      PER     0.7842    0.8144    0.7990      1794
      ORG     0.5817    0.6850    0.6291      2146
      LOC     0.7487    0.7780    0.7631      3428

micro avg     0.7040    0.7598    0.7308      7368
macro avg     0.7087    0.7598    0.7328      7368

# keras

           precision    recall  f1-score   support

      ORG     0.5975    0.6109    0.6041      2146
      LOC     0.7287    0.7695    0.7485      3427
      PER     0.7375    0.8449    0.7875      1792

micro avg     0.6943    0.7416    0.7172      7365
macro avg     0.6926    0.7416    0.7159      7365
alexwwang commented 5 years ago

I noticed that you set layer_nums=1 in TF.keras version, what if set 4? Also worse?

On Fri, 7 Jun 2019 at 13:47, Eliyar Eziz notifications@github.com wrote:

When test full-data without Embedding for 10 epochs, here is the result.

tf.keras

       precision    recall  f1-score   support

  PER     0.7842    0.8144    0.7990      1794
  ORG     0.5817    0.6850    0.6291      2146
  LOC     0.7487    0.7780    0.7631      3428

micro avg 0.7040 0.7598 0.7308 7368 macro avg 0.7087 0.7598 0.7328 7368

keras

       precision    recall  f1-score   support

  ORG     0.5975    0.6109    0.6041      2146
  LOC     0.7287    0.7695    0.7485      3427
  PER     0.7375    0.8449    0.7875      1792

micro avg 0.6943 0.7416 0.7172 7365 macro avg 0.6926 0.7416 0.7159 7365

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKXKPYBJLAR7BDLSHVDPZHZAZA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXE46DY#issuecomment-499765007, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKUYQWBCXI3OXHWITFLPZHZAZANCNFSM4HPW2D6A .

BrikerMan commented 5 years ago

@alexwwang I have tried layer_num=4, it is worse too.

           precision    recall  f1-score   support

      ORG     0.0145    0.0152    0.0148       132
      PER     0.0596    0.0592    0.0594       152
      LOC     0.0696    0.0804    0.0746       199

micro avg     0.0520    0.0559    0.0539       483
macro avg     0.0514    0.0559    0.0535       483
alexwwang commented 5 years ago

It seems a detailed check is needed.

On Fri, Jun 7, 2019, 15:36 Eliyar Eziz notifications@github.com wrote:

@alexwwang https://github.com/alexwwang I have tried layer_num=4, it is worse too.

       precision    recall  f1-score   support

  ORG     0.0145    0.0152    0.0148       132
  PER     0.0596    0.0592    0.0594       152
  LOC     0.0696    0.0804    0.0746       199

micro avg 0.0520 0.0559 0.0539 483 macro avg 0.0514 0.0559 0.0535 483

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKVKVEVKEN6XUXG4QHDPZIFX7A5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXFCVSY#issuecomment-499788491, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKSPW26AE7O7Y4Q3NC3PZIFX7ANCNFSM4HPW2D6A .

BrikerMan commented 5 years ago

It seems a detailed check is needed.

Yes, I need help here, I have checked several times, still got nothing. @alexwwang

alexwwang commented 5 years ago

I am working on it.

On Sat, 8 Jun 2019 at 10:51, Eliyar Eziz notifications@github.com wrote:

It seems a detailed check is needed.

Yes, I need help here, I have checked several times, still got nothing. @alexwwang https://github.com/alexwwang

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSYK7AANLIEYDMC5DLPZMNDNA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXHLUNQ#issuecomment-500087350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKRZXRLOSTZFSFN7ZKTPZMNDNANCNFSM4HPW2D6A .

alexwwang commented 5 years ago

Would you mind do a test as follows? Build up a model with only Bert embedding and a softmax layer but nothing else added up and run the ner task in this two environments?

On Sat, Jun 8, 2019, 10:51 Eliyar Eziz notifications@github.com wrote:

It seems a detailed check is needed.

Yes, I need help here, I have checked several times, still got nothing. @alexwwang https://github.com/alexwwang

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSYK7AANLIEYDMC5DLPZMNDNA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXHLUNQ#issuecomment-500087350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKRZXRLOSTZFSFN7ZKTPZMNDNANCNFSM4HPW2D6A .

haoyuhu commented 5 years ago

Maybe it's a bug in classification_report. If I use sklearn.metrics.classification_report instead. The prediction looks fine.

tf.keras-version

Codes

import logging
logging.basicConfig(level=logging.DEBUG)
import kashgari
from kashgari.embeddings import BERTEmbedding
from kashgari.corpus import ChineseDailyNerCorpus
from kashgari.tasks.labeling import BLSTMModel
from sklearn.metrics import classification_report

train_x, train_y = ChineseDailyNerCorpus.load_data('train', shuffle=False)
test_x, test_y = ChineseDailyNerCorpus.load_data('test', shuffle=False)
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid', shuffle=False)

train_count = int(len(train_y)*0.1)
test_count = int(len(test_y)*0.1)
valid_count = int(len(valid_x)*0.1)

train_x, train_y = train_x[:train_count], train_y[:train_count]
test_x, test_y = test_x[:test_count], test_y[:test_count]
valid_x, valid_y = valid_x[:valid_count], valid_y[:valid_count]

embedding = BERTEmbedding('/home/hahahu/projects/models/bert-base-chinese', task=kashgari.LABELING, sequence_length=100, layer_nums=4)
model = BLSTMModel(embedding)

model.fit(train_x, train_y, valid_x, valid_y, batch_size=64, epochs=10)

# model.evaluate(test_x, test_y, batch_size=512, debug_info=True)
y_pred = model.predict(test_x, batch_size=512)
y_true = [seq[:model.embedding.sequence_length] for seq in test_y]

for index in random.sample(list(range(len(test_x))), 5):
    logging.debug('------ sample {} ------'.format(index))
    logging.debug('x      : {}'.format(test_x[index]))
    logging.debug('y_true : {}'.format(y_true[index]))
    logging.debug('y_pred : {}'.format(y_pred[index]))

print(classification_report(y_true, y_pred, digits=4))

Output

             precision    recall  f1-score   support

      B-LOC     0.8898    0.9417    0.9150       120
      B-ORG     0.8889    0.8511    0.8696        94
      B-PER     0.9902    0.9619    0.9758       105
      I-LOC     0.8618    0.9298    0.8945       114
      I-ORG     0.9011    0.8723    0.8865        94
      I-PER     0.9406    0.9500    0.9453       100
          O     1.0000    1.0000    1.0000       463

avg / total     0.9489    0.9541    0.9512      1090

0.2.4

Codes

import logging
logging.basicConfig(level=logging.DEBUG)
import kashgari
from kashgari.embeddings import BERTEmbedding
from kashgari.corpus import ChinaPeoplesDailyNerCorpus as ChineseDailyNerCorpus
from kashgari.tasks.seq_labeling import BLSTMModel
from sklearn.metrics import classification_report

train_x, train_y = ChineseDailyNerCorpus.get_sequence_tagging_data('train', shuffle=False)
test_x, test_y = ChineseDailyNerCorpus.get_sequence_tagging_data('test', shuffle=False)
valid_x, valid_y = ChineseDailyNerCorpus.get_sequence_tagging_data('valid', shuffle=False)

train_count = int(len(train_y)*0.1)
test_count = int(len(test_y)*0.1)
valid_count = int(len(valid_x)*0.1)

train_x, train_y = train_x[:train_count], train_y[:train_count]
test_x, test_y = test_x[:test_count], test_y[:test_count]
valid_x, valid_y = valid_x[:valid_count], valid_y[:valid_count]

embedding = BERTEmbedding('/home/hahahu/projects/models/bert-base-chinese', 100)
model = BLSTMModel(embedding)

model.fit(train_x, train_y, valid_x, valid_y, batch_size=64, epochs=10)

# model.evaluate(test_x, test_y, batch_size=512, debug_info=True)
y_pred = model.predict(test_x, batch_size=512)
y_true = [seq[:model.embedding.sequence_length] for seq in test_y]

for index in random.sample(list(range(len(test_x))), 5):
    logging.debug('------ sample {} ------'.format(index))
    logging.debug('x      : {}'.format(test_x[index]))
    logging.debug('y_true : {}'.format(y_true[index]))
    logging.debug('y_pred : {}'.format(y_pred[index]))

print(classification_report(y_true, y_pred, digits=4))

Output

             precision    recall  f1-score   support

      B-LOC     0.8926    0.9000    0.8963       120
      B-ORG     0.8864    0.8298    0.8571        94
      B-PER     0.9902    0.9619    0.9758       105
      I-LOC     0.8607    0.9211    0.8898       114
      I-ORG     0.9412    0.8511    0.8939        94
      I-PER     0.9500    0.9500    0.9500       100
          O     1.0000    1.0000    1.0000       463

avg / total     0.9532    0.9450    0.9487      1090

BTW: sklearn of current version doesn't support legacy multi-label data representation. We can use the version 0.16.1 or convert y_pred and y_true by sklearn.preprocessing.MultiLabelBinarizer().fit_transform(y)

BrikerMan commented 5 years ago

@HaoyuHu Please help me try this, I have failed to run it with 0.16.1.

from sklearn.metrics import classification_report
y_true = [['O', 'A', 'B']]
y_pred = [['A', 'B', 'O']]
classification_report(y_true, y_pred)
alexwwang commented 5 years ago

What's wrong with this code fraction in sklearn 0.16.1?

On Sun, 9 Jun 2019 at 13:39, Eliyar Eziz notifications@github.com wrote:

@HaoyuHu https://github.com/HaoyuHu Please help me try this, I have failed to run it with 0.16.1.

from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSZPSVEBJ7ZSCQV26TPZSJQPA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXID37A#issuecomment-500186620, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKVSDHK7GWXOS5JM43TPZSJQPANCNFSM4HPW2D6A .

haoyuhu commented 5 years ago

@HaoyuHu Please help me try this, I have failed to run it with 0.16.1.

from sklearn.metrics import classification_report
y_true = [['O', 'A', 'B']]
y_pred = [['A', 'B', 'O']]
classification_report(y_true, y_pred)
             precision    recall  f1-score   support

          A       1.00      1.00      1.00         1
          B       1.00      1.00      1.00         1
          O       1.00      1.00      1.00         1

avg / total       1.00      1.00      1.00         3
haoyuhu commented 5 years ago

Some debug details for tf.keras-version:

DEBUG:root:------ sample 65 ------
DEBUG:root:x      : ['一', '些', '企', '业', '原', '本', '不', '生', '产', '干', '红', ',', '所', '以', '既', '无', '稳', '定', '的', '资', '源', ',', '又', '无', '可', '靠', '的', '技', '术', ',', '更', '无', '足', '够', '的', '资', '本', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 433 ------
DEBUG:root:x      : ['至', '于', '女', '双', ',', '葛', '菲', '/', '顾', '俊', '近', '几', '年', '一', '直', '是', '打', '遍', '天', '下', '无', '敌', '手', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 440 ------
DEBUG:root:x      : ['这', '次', '运', '动', '会', '共', '设', '有', '6', '0', '个', '比', '赛', '项', '目', ',', '其', '中', '包', '括', '消', '防', '类', '体', '育', '项', '目', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 301 ------
DEBUG:root:x      : ['回', '顾', '过', '去', '的', '艰', '难', '历', '程', ',', '井', '陉', '人', '深', '深', '感', '到', ':', '开', '发', '特', '色', '农', '业', ',', '实', '施', '名', '牌', '战', '略', ',', '是', '一', '项', '系', '统', '工', '程', ',', '需', '要', '政', '府', '引', '导', '、', '部', '门', '协', '作', '和', '农', '民', '群', '众', '的', '广', '泛', '参', '与', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:------ sample 450 ------
DEBUG:root:x      : ['由', '于', '上', '届', '世', '锦', '赛', '战', '绩', '不', '佳', ',', '俄', '罗', '斯', '队', '主', '教', '练', '戈', '麦', '尔', '斯', '基', '将', '每', '一', '个', '对', '手', '都', '视', '为', '劲', '敌', ',', '他', '特', '别', '提', '到', '明', '天', '首', '场', '对', '中', '国', '队', '的', '比', '赛', '会', '很', '艰', '难', '。']
DEBUG:root:y_true : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
DEBUG:root:y_pred : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
haoyuhu commented 5 years ago

y_pred and y_true is almost the same.

BrikerMan commented 5 years ago

@HaoyuHu Please help me try this, I have failed to run it with 0.16.1.

from sklearn.metrics import classification_report
y_true = [['O', 'A', 'B']]
y_pred = [['A', 'B', 'O']]
classification_report(y_true, y_pred)
             precision    recall  f1-score   support

          A       1.00      1.00      1.00         1
          B       1.00      1.00      1.00         1
          O       1.00      1.00      1.00         1

avg / total       1.00      1.00      1.00         3

This is wrong... For a Sequence Labeling task, this sample's accuracy and recall both should be zero.

alexwwang commented 5 years ago

So it's inorder in a multi-label prediction output without considering the confidence of each result.

Back to the original problem, the ner task, is it a multi-label task or calculating the most possible result for each character?

On Sun, 9 Jun 2019 at 13:52, hahahu notifications@github.com wrote:

@HaoyuHu https://github.com/HaoyuHu Please help me try this, I have failed to run it with 0.16.1.

from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)

         precision    recall  f1-score   support

      A       1.00      1.00      1.00         1
      B       1.00      1.00      1.00         1
      O       1.00      1.00      1.00         1

avg / total 1.00 1.00 1.00 3

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKQRYLI5RQL6XRVUWB3PZSLBTA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXID7ZA#issuecomment-500187108, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKTRYLXB4MLCRMP5GQDPZSLBTANCNFSM4HPW2D6A .

BrikerMan commented 5 years ago

So it's inorder in a multi-label prediction output without considering the confidence of each result. Back to the original problem, the ner task, is it a multi-label task or calculating the most possible result for each character?

In the labeling task, we should calculate the most possible result for each input token, labeling order matters.

alexwwang commented 5 years ago

I guess it is caused by the data structure input to the classification report. I'll check the documents later to figure out it.

On Sun, 9 Jun 2019 at 13:56, Eliyar Eziz notifications@github.com wrote:

@HaoyuHu https://github.com/HaoyuHu Please help me try this, I have failed to run it with 0.16.1.

from sklearn.metrics import classification_report y_true = [['O', 'A', 'B']] y_pred = [['A', 'B', 'O']] classification_report(y_true, y_pred)

         precision    recall  f1-score   support

      A       1.00      1.00      1.00         1
      B       1.00      1.00      1.00         1
      O       1.00      1.00      1.00         1

avg / total 1.00 1.00 1.00 3

This is wrong... For a Sequence Labeling task, this sample's accuracy and recall both should be zero.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKSNT3OCRGITNKAKSO3PZSLP5A5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIEBGI#issuecomment-500187289, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKQADE7OEZPMSP4IY53PZSLP5ANCNFSM4HPW2D6A .

haoyuhu commented 5 years ago

I'm sorry about using sklearn.metrics.classification_report here. The random sample output is almost right, but the result of classification_report is weird.

BrikerMan commented 5 years ago

I'm sorry about using sklearn.metrics.classification_report here. The random sample output is almost right, but the result of classification_report is weird.

That's cool man, we need to keep trying new ways pinpoint the bug.

alexwwang commented 5 years ago

Yes, but the data structure of input to classification report seems to indicate a multi-label prediction and original set comparison.

On Sun, 9 Jun 2019 at 14:01, Eliyar Eziz notifications@github.com wrote:

So it's inorder in a multi-label prediction output without considering the confidence of each result. Back to the original problem, the ner task, is it a multi-label task or calculating the most possible result for each character?

In the labeling task, we should calculate the most possible result for each input token, labeling order matters.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKXLQSHKIAAJHOCOFGTPZSMDNA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIEDBI#issuecomment-500187525, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKSNEBVENDRJ3RK4YQTPZSMDNANCNFSM4HPW2D6A .

alexwwang commented 5 years ago

This is a piece of good news! So it's of high probability to exist some bugs in accurateness evaluation.

On Sun, 9 Jun 2019 at 13:54, hahahu notifications@github.com wrote:

y_pred and y_true is almost the same.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKWMENZWRK5MR22TTKDPZSLIRA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIEAPI#issuecomment-500187197, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKQBB5VWRPZXFIDICLLPZSLIRANCNFSM4HPW2D6A .

BrikerMan commented 5 years ago

I have tried to build BERT-BLSTM model from the scratch, it works fine with tf 1.13.1 and tf 2.0 beta. It works just fine.

alexwwang commented 5 years ago

I am confused with sevqeval.classification_report now. How do you mean by just fine? Evaluating with seqeval.classification_report?

On Mon, Jun 10, 2019, 22:49 Eliyar Eziz notifications@github.com wrote:

I have tried to build BERT-BLSTM model from the scratch, it works fine with tf 1.13.1 and tf 2.0 beta. It works just fine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKUK2DIHXE25L2UB6P3PZZSXHA5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXKC6FQ#issuecomment-500444950, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKQ4722OAGAQZ2HRPELPZZSXHANCNFSM4HPW2D6A .

BrikerMan commented 5 years ago
from seqeval.metrics import classification_report
print(classification_report(y_true, y_pred))

           precision    recall  f1-score   support

      LOC       0.86      0.88      0.87      3431
      ORG       0.73      0.84      0.79      2148
      PER       0.92      0.93      0.92      1798

micro avg       0.84      0.88      0.86      7377
macro avg       0.84      0.88      0.86      7377
haoyuhu commented 5 years ago

I have tried to build BERT-BLSTM model from the scratch, it works fine with tf 1.13.1 and tf 2.0 beta. It works just fine.

Is there any difference between the sample code above and your built BERT-BLSTM model from the scratch? Does this mean that it is not a problem with classification_report? :+1:

BrikerMan commented 5 years ago

Hi guys, I pinpointed the issue. In BERT embedding, I have added <BOS> and <EOS> token to the sequence. Then when reversing index to labels, I have removed the <BOS> <EOS> token, during this process, for samples which are longer than sequence_length, it makes the len(y_true[x]) != len(y_pred[x]). Which will cause a very poor result from classification_report.

Possible fix, I have tried this in notebook.


y_pred = model.predict(test_x)
y_true = [seq[:len(y_pred[index])] for index, seq in enumerate(test_y)]

print(classification_report(y_true, y_pred))

result look good

           precision    recall  f1-score   support

      PER       0.92      0.93      0.92       152
      LOC       0.73      0.76      0.75       199
      ORG       0.64      0.74      0.69       132

micro avg       0.76      0.81      0.79       483
macro avg       0.77      0.81      0.79       483
BrikerMan commented 5 years ago

Good news guys, fixed. After 10 epoch with 64 batch-size, here is the result.

embedding = BERTEmbedding('/input0/BERT/chinese_L-12_H-768_A-12',
                          task=kashgari.LABELING,
                          sequence_length=100,
                          layer_nums=4)
model = BLSTMModel(embedding)
model.fit(train_x,
          train_y,
          valid_x,
          valid_y,
          batch_size=64,
          epochs=10)
model.evaluate(test_x, test_y, batch_size=512)

           precision    recall  f1-score   support

      LOC     0.9265    0.9370    0.9317      3431
      ORG     0.8364    0.8808    0.8580      2147
      PER     0.9644    0.9644    0.9644      1797

micro avg     0.9084    0.9273    0.9177      7375
macro avg     0.9095    0.9273    0.9182      7375
BrikerMan commented 5 years ago

Guys. Finally fixed fit_generator issue. Now we could change the fit_with_generator as the default method for memory saving.

haoyuhu commented 5 years ago

Cheers!

alexwwang commented 5 years ago

Good to know!

On Wed, 10 Jul 2019 at 18:30, hahahu notifications@github.com wrote:

Cheers!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BrikerMan/Kashgari/issues/96?email_source=notifications&email_token=AAGRFKQIN5YQXKLEPXJL3MDP6W227A5CNFSM4HPW2D6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZTBJIY#issuecomment-510006435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGRFKUNPRPDG2C5D4V56STP6W227ANCNFSM4HPW2D6A .