131250208 / TPlinker-joint-extraction

438 stars 94 forks source link

自定义数据集上关系识别的效果很差 #61

Closed Wonderson-wpp closed 2 years ago

Wonderson-wpp commented 2 years ago
    您好,在我用实验室的生物文献数据集上应用您的tplinker和TPLinker_plus(主要是后者), 实体分类的指标都很高,但是关系分类的各项指标只有0.1左右。
    数据集的体量不小(4000条),epoch达到了40轮左右,由于我batch_size设置较大,应该是收敛了,以下是我TPLinker_plus的训练参数,您有什么建议吗?
        "shaking_type": "cln_plus",
        "inner_enc_type": "lstm",
        "match_pattern": "whole_text",

        "data_home": "../data4bert",
        "bert_path": "../../pretrained_models/bert-base-cased",
        "hyper_parameters": {
            "lr": 5e-5,
        },

        "batch_size": 32,
        "epochs": 200,
        "seed": 2333,
        "log_interval": 10,
        "max_seq_len": 128,
        "sliding_len": 20,
        "scheduler": "CAWR", # Step

这是我们数据集一条样本的情况:

{"text": "MicroRNA-873 acts as a tumor suppressor in esophageal cancer by inhibiting differentiated embryonic chondrocyte expressed gene 2.", "id": "train_0", "relation_list": [{"subject": "MicroRNA-873", "object": "esophageal cancer", "subj_char_span": [0, 12], "obj_char_span": [43, 60], "predicate": "/Gene/Cancer/tumor_suppressor", "subj_tok_span": [0, 5], "obj_tok_span": [12, 17]}], "entity_list": [{"text": "MicroRNA-873", "type": "Gene", "char_span": [0, 12], "tok_span": [0, 5]}, {"text": "esophageal cancer", "type": "Cancer", "char_span": [43, 60], "tok_span": [12, 17]}]}
131250208 commented 2 years ago

关于自定义数据集的训练效果问题我很难给出建议,毕竟我没试过。batch size设置较大时收敛更慢,需要更多epoch,观察几个epoch后指标是否还在上升。除了参数的问题,也可能是预处理有问题,还可以分析一下错误样例找具体原因。

Wonderson-wpp commented 2 years ago

谢谢您的回复