[Question] how to k-fold evaluation with kashgari

Jefffish09 commented 4 years ago

You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed. 请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写，将会忽略并关闭这个 issue

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

[Y] I have searched in existing issues but did not find the same one.
[Y] I have read the documents

Environment

OS [e.g. Mac OS, Linux]: Windows 10
Python Version: 3.7.3
kashgari 2.0.0a1

Question

请问在NER场景下进行k折交叉验证呢？用demo code试了几次套进去都不行，根据网上Evaluate the Performance Of Deep Learning Models in Keras，不知道如何改来进行k折交叉验证，请帮忙看看。

Demo code:

from kashgari.corpus import ChineseDailyNerCorpus
from kashgari.tasks.labeling import BiLSTM_Model

train_x, train_y = ChineseDailyNerCorpus.load_data('train')
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
test_x, test_y = ChineseDailyNerCorpus.load_data('test')

model = BiLSTM_Model()
model.fit(train_x, train_y, valid_x, valid_y)

model.evaluate(test_x, test_y)

model.save('saved_ner_model')

BrikerMan commented 4 years ago

You need to implement it by yourself, remember to check the report object to get the right metric you need.

Example:

from sklearn.model_selection import StratifiedKFold
import numpy as np
from kashgari.corpus import SMP2018ECDTCorpus
from kashgari.tasks.classification import BiLSTM_Model

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# Combine all data for k-folding

train_x, train_y = SMP2018ECDTCorpus.load_data('train')
valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
test_x, test_y = SMP2018ECDTCorpus.load_data('test')

X = train_x + valid_x + test_x
Y = train_y + valid_y + test_y

# define 10-fold cross validation test harness
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
scores = []

for train_indexs, test_indexs in k_fold.split(X, Y):
    train_x, train_y = [], []
    test_x, test_y = [], []

    for i in train_indexs:
        train_x.append(X[i])
        train_y.append(Y[i])

    assert len(train_x) == len(train_y)
    for i in test_indexs:
        test_x.append(X[i])
        test_y.append(Y[i])

    assert len(test_x) == len(test_y)
    model = BiLSTM_Model()
    model.fit(train_x, train_y, epochs=10)

    report = model.evaluate(test_x, test_y)
    # extract your target metric from report, for example f1
    scores.append(report['f1-score'])

print(f"{np.mean(scores):.2f}  (+/- {np.std(scores):.2f})")

Jefffish09 commented 4 years ago

Perfect! Thanks!

BrikerMan / Kashgari