131250208 / TPlinker-joint-extraction

438 stars 94 forks source link

有关BuildData使用中文样本途中遇到的问题 #38

Closed jarork closed 3 years ago

jarork commented 3 years ago

我模仿NYT_star的数据格式把中文样本转换为模型要求的格式,但是发现模型无法标注出实体和关系,ent2id.json和rel2id.json里面都是空的{}。我的数据格式如下:

[ { "text": "急诊胸部CT:临床提示:胸闷头痛3天扫描层厚:5mm影像所示:两下肺少许渗出,两侧胸腔微量积液。无 明显气管、支气管异物;无 明显食管异物;无 气胸、液气胸征象;无 明显纵隔气肿、占位;无 明显心脏、大血管形态改变,无 明显心包积液。(所示肋骨)无 明显肋骨错位性骨折。", "triple_list": [ [ "微量", "修饰", "积液" ], [ "两侧", "修饰", "胸腔" ], [ "胸腔", "修饰", "积液" ], [ "少许", "修饰", "两下肺" ], [ "两下肺", "修饰", "渗出" ] ] }, ...... ]

BuildData配置文件build_data_config.yaml中我的设置是:

exp_name: deepwise # nyt_star, nyt, webnlg_star, webnlg, ace05_lu data_in_dir: ../datasets/ori_data ori_data_format: casrel # casrel (webnlg_star, nyt_star), etl_span (webnlg), raw_nyt (nyt), tplinker (see readme)

encoder: BERT bert_path: ../../pretrained_models/chinese-bert-wwm-ext-hit-pytorch-huggingface data_out_dir: ../datasets/train_data/debugging

add_char_span: true ignore_subword: true separate_char_by_white: false check_tok_span: true

中文BERT模型我用的是哈工大的wwm

下载地址是:https://huggingface.co/hfl/chinese-bert-wwm-ext

131250208 commented 3 years ago

@jarork image

jarork commented 3 years ago

@jarork image

非常感谢,修改后问题已经解决了~

nlper01 commented 7 months ago

我模仿NYT_star的数据格式把中文样本转换为模型要求的格式,但是发现模型无法标注出实体和关系,ent2id.json和rel2id.json里面都是空的{}。我的数据格式如下:

[ { "text": "急诊胸部CT:临床提示:胸闷头痛3天扫描层厚:5mm影像所示:两下肺少许渗出,两侧胸腔微量积液。无 明显气管、支气管异物;无 明显食管异物;无 气胸、液气胸征象;无 明显纵隔气肿、占位;无 明显心脏、大血管形态改变,无 明显心包积液。(所示肋骨)无 明显肋骨错位性骨折。", "triple_list": [ [ "微量", "修饰", "积液" ], [ "两侧", "修饰", "胸腔" ], [ "胸腔", "修饰", "积液" ], [ "少许", "修饰", "两下肺" ], [ "两下肺", "修饰", "渗出" ] ] }, ...... ]

BuildData配置文件build_data_config.yaml中我的设置是:

exp_name: deepwise # nyt_star, nyt, webnlg_star, webnlg, ace05_lu data_in_dir: ../datasets/ori_data ori_data_format: casrel # casrel (webnlg_star, nyt_star), etl_span (webnlg), raw_nyt (nyt), tplinker (see readme) encoder: BERT bert_path: ../../pretrained_models/chinese-bert-wwm-ext-hit-pytorch-huggingface data_out_dir: ../datasets/train_data/debugging add_char_span: true ignore_subword: true separate_char_by_white: false check_tok_span: true

中文BERT模型我用的是哈工大的wwm

下载地址是:https://huggingface.co/hfl/chinese-bert-wwm-ext

大佬, 你这是公开数据集吗?看着像cmeie,能否分享一份处理好的数据或者数据处理代码,我的格式跟你一致,转换出来一直有问题