131250208 / TPlinker-joint-extraction

438 stars 94 forks source link

请问tplink plus的输入数据的格式和tplink输入数据的格式有什么区别? #67

Closed AI-Mart closed 2 years ago

AI-Mart commented 2 years ago

作者的工作非常棒,有个问题咨询下,针对百度的中文数据集,请问tplink plus的输入数据的格式和tplink输入数据的格式有什么区别?是不是只要吧tplink的ent2id.json 的{"DEFAULT": 0}换成对应的实际实体类别 tplink训练数据集如下,把里面的 "type": "DEFAULT"换成对应的实体实际类别就是tplink plus的输入格式了? [{"text": "《邪少兵王》是冰火未央写的网络小说连载于旗峰天下", "id": "train_0", "relation_list": [{"subject": "邪少兵王", "object": "冰火未央", "subj_char_span": [1, 5], "obj_char_span": [7, 11], "predicate": "作者", "subj_tok_span": [1, 5], "obj_tok_span": [7, 11]}], "entity_list": [{"text": "邪少兵王", "type": "DEFAULT", "char_span": [1, 5], "tok_span": [1, 5]}, {"text": "冰火未央", "type": "DEFAULT", "char_span": [7, 11], "tok_span": [7, 11]}]}, {"text": "GV-971由中国海洋大学、中国科学院上海药物研究所(下称“上海药物所”)和上海绿谷制药有限公司(下称“绿谷制药”)联合研发,不同于传统靶向抗体药物,GV-971是从海藻中提取的海洋寡糖类分子", "id": "train_1", "relation_list": [{"subject": "中国科学院上海药物研究所", "object": "上海药物所", "subj_char_span": [14, 26], "obj_char_span": [30, 35], "predicate": "简称", "subj_tok_span": [12, 24], "obj_tok_span": [28, 33]}], "entity_list": [{"text": "中国科学院上海药物研究所", "type": "DEFAULT", "char_span": [14, 26], "tok_span": [12, 24]}, {"text": "上海药物所", "type": "DEFAULT", "char_span": [30, 35], "tok_span": [28, 33]}]}]

131250208 commented 2 years ago

15

AI-Mart commented 2 years ago

感谢作者的回复,以下这个数据格式是tplink和tplink plus都适用吗,还是只适用tplink plus,tplink需要把 "entity_list": [{"text": "兰陵王", "type": "影视作品",里面的type改成default?

{"id": "valid_6062", "text": "2013年,林依晨拍摄古装剧《兰陵王》获得极高好评,在此之后,林依晨似乎是“转战”电影行业,至今为止只拍摄了几部电影", "relation_list": [{"subject": "兰陵王", "object": "林依晨", "subj_char_span": [15, 18], "obj_char_span": [6, 9], "predicate": "主演", "subj_tok_span": [12, 15], "obj_tok_span": [3, 6]}, {"subject": "兰陵王", "object": "林依晨", "subj_char_span": [15, 18], "obj_char_span": [31, 34], "predicate": "主演", "subj_tok_span": [12, 15], "obj_tok_span": [28, 31]}], "entity_list": [{"text": "兰陵王", "type": "影视作品", "char_span": [15, 18], "tok_span": [12, 15]}, {"text": "林依晨", "type": "人物", "char_span": [6, 9], "tok_span": [3, 6]}, {"text": "林依晨", "type": "人物", "char_span": [31, 34], "tok_span": [28, 31]}]}

131250208 commented 2 years ago

TPLinker 现在的代码不支持实体类型,你可以自行增加实体部分的标签