ZhuiyiTechnology / simbert

a bert for retrieval and generation
Apache License 2.0
847 stars 150 forks source link

在lcqmc数据集上微调效果下降 #19

Open elihuan1990 opened 3 years ago

elihuan1990 commented 3 years ago

在lcqmc数据集上微调simbert,在测试集上spearman指标下降一个点,怎么微调simbert呢?

bojone commented 2 years ago

可以用sentence-bert的方式微调

WenTingTseng commented 1 year ago

請問simbert.py訓練完模型並儲存best_model.weights了 我要如何加載best_model.weights模型並測試 `from bert4keras.tokenizers import Tokenizer from bert4keras.models import build_transformer_model from keras.models import Model import numpy as np

config_path = '/home/rca/research/simbert/root/kg/bert/chinese_simbert_L-12_H-768_A-12/bert_config.json' checkpoint_path = './latest_model.ckpt' dict_path = '/home/rca/research/simbert/root/kg/bert/chinese_simbert_L-12_H-768_A-12/vocab.txt'

tokenizer = Tokenizer(dict_path, do_lower_case=True)

bert = build_transformer_model( config_path, checkpoint_path, with_pool='linear', application='unilm', return_keras_model=False, ) model = Model(inputs=bert.model.inputs, outputs=bert.model.outputs) model.load_weights(checkpoint_path, by_name=True) # 加载权重时需要加上 by_name=True

test_sentence = "微信和支付宝哪个好?"

def gen_similar_sentences(text, n=10, k=10): similar_sentences = gen_synonyms(text, n, k) # 需要定义 gen_synonyms 函数 return similar_sentences

token_ids, segment_ids = tokenizer.encode(test_sentence, max_length=maxlen)

output_ids = model.predict([np.array([token_ids]), np.array([segment_ids])]) output_ids = output_ids[0].argmax(axis=1)

generated_sentence = tokenizer.decode(output_ids)

print(f"原句子:{test_sentence}") print(f"生成句子:{generated_sentence}") print("相似句子:") similar_sentences = gen_similar_sentences(test_sentence) for idx, sentence in enumerate(similar_sentences): print(f"{idx + 1}. {sentence}")` 是這樣寫嗎

HelenGuohx commented 9 months ago

我的方法是直接 from simbert import gen_synonyms,这样模型会加载新的权重