Onion12138 / CasRelPyTorch

Reimplement CasRel model in PyTorch.使用PyTorch对吉林大学CasRel模型进行复现,并在百度关系抽取数据集上训练测试。
180 stars 26 forks source link

如果换成英文数据集能否达到casrel原文的准确率呢? #1

Closed iamxiongwei163 closed 3 years ago

iamxiongwei163 commented 3 years ago

您好,我想问一下我把数据集换成WebNLG后,召回率比原文低很多。这是否是因为您将编码解码方式更改后导致的呢?

Onion12138 commented 3 years ago

当然,因为中文是按照字切分的,英文是按照子词切分。我没有处理任何关于子词切分的逻辑,所以会影响英文数据集的精确率和召回率。但应该影响精确率更多?

iamxiongwei163 commented 3 years ago

当然,因为中文是按照字切分的,英文是按照子词切分。我没有处理任何关于子词切分的逻辑,所以会影响英文数据集的精确率和召回率。但应该影响精确率更多?

我跑出来,召回率只有73,原文是90。准确率倒是跟论文中差不多的。

iamxiongwei163 commented 3 years ago

英文数据集需要在词片中间添加'[unused]',以防模型混淆单词的边界。添加后准确率与原文一致。

2019hong commented 3 years ago

@iamxiongwei163 大佬您好!很抱歉打扰您!我最近在学习关系抽取的相关知识,在尝试在作者的模型上跑webnlg数据集,但我的f1值几乎只有0.08左右,想问下您知道问题可能出在哪里吗?如果可以的话可以给我些指导和建议吗?给您添麻烦了,一直没有什么头绪,会是我数据集的问题吗?我也尝试调整了参数,但好像没什么用处,打扰您了。 image 数据集1条:{"text": "Peter Stöger is manager of 1 . FC Köln which has 50000 members and participated in the 2014 season .", "id": "train_0", "spo_list": [{"subject": "1 . FC Köln", "object": "Peter Stöger", "subj_char_span": [27, 38], "obj_char_span": [0, 12], "predicate": "manager", "subj_tok_span": [7, 13], "obj_tok_span": [0, 4]}, {"subject": "1 . FC Köln", "object": "2014", "subj_char_span": [27, 38], "obj_char_span": [87, 91], "predicate": "season", "subj_tok_span": [7, 13], "obj_tok_span": [22, 23]}, {"subject": "1 . FC Köln", "object": "50000", "subj_char_span": [27, 38], "obj_char_span": [49, 54], "predicate": "numberOfMembers", "subj_tok_span": [7, 13], "obj_tok_span": [15, 17]}], "entity_list": [{"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "Peter Stöger", "type": "DEFAULT", "char_span": [0, 12], "tok_span": [0, 4]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "2014", "type": "DEFAULT", "char_span": [87, 91], "tok_span": [22, 23]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "50000", "type": "DEFAULT", "char_span": [49, 54], "tok_span": [15, 17]}]}

Onion12138 commented 3 years ago

@iamxiongwei163 大佬您好!很抱歉打扰您!我最近在学习关系抽取的相关知识,在尝试在作者的模型上跑webnlg数据集,但我的f1值几乎只有0.08左右,想问下您知道问题可能出在哪里吗?如果可以的话可以给我些指导和建议吗?给您添麻烦了,一直没有什么头绪,会是我数据集的问题吗?我也尝试调整了参数,但好像没什么用处,打扰您了。 image 数据集1条:{"text": "Peter Stöger is manager of 1 . FC Köln which has 50000 members and participated in the 2014 season .", "id": "train_0", "spo_list": [{"subject": "1 . FC Köln", "object": "Peter Stöger", "subj_char_span": [27, 38], "obj_char_span": [0, 12], "predicate": "manager", "subj_tok_span": [7, 13], "obj_tok_span": [0, 4]}, {"subject": "1 . FC Köln", "object": "2014", "subj_char_span": [27, 38], "obj_char_span": [87, 91], "predicate": "season", "subj_tok_span": [7, 13], "obj_tok_span": [22, 23]}, {"subject": "1 . FC Köln", "object": "50000", "subj_char_span": [27, 38], "obj_char_span": [49, 54], "predicate": "numberOfMembers", "subj_tok_span": [7, 13], "obj_tok_span": [15, 17]}], "entity_list": [{"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "Peter Stöger", "type": "DEFAULT", "char_span": [0, 12], "tok_span": [0, 4]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "2014", "type": "DEFAULT", "char_span": [87, 91], "tok_span": [22, 23]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "50000", "type": "DEFAULT", "char_span": [49, 54], "tok_span": [15, 17]}]}

英文的分词方法不一样,所以需要自己重新写一下解码的逻辑。你看看从这个角度能否解决。

iamxiongwei163 commented 3 years ago

@iamxiongwei163 大佬您好!很抱歉打扰您!我最近在学习关系抽取的相关知识,在尝试在作者的模型上跑webnlg数据集,但我的f1值几乎只有0.08左右,想问下您知道问题可能出在哪里吗?如果可以的话可以给我些指导和建议吗?给您添麻烦了,一直没有什么头绪,会是我数据集的问题吗?我也尝试调整了参数,但好像没什么用处,打扰您了。 image 数据集1条:{"text": "Peter Stöger is manager of 1 . FC Köln which has 50000 members and participated in the 2014 season .", "id": "train_0", "spo_list": [{"subject": "1 . FC Köln", "object": "Peter Stöger", "subj_char_span": [27, 38], "obj_char_span": [0, 12], "predicate": "manager", "subj_tok_span": [7, 13], "obj_tok_span": [0, 4]}, {"subject": "1 . FC Köln", "object": "2014", "subj_char_span": [27, 38], "obj_char_span": [87, 91], "predicate": "season", "subj_tok_span": [7, 13], "obj_tok_span": [22, 23]}, {"subject": "1 . FC Köln", "object": "50000", "subj_char_span": [27, 38], "obj_char_span": [49, 54], "predicate": "numberOfMembers", "subj_tok_span": [7, 13], "obj_tok_span": [15, 17]}], "entity_list": [{"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "Peter Stöger", "type": "DEFAULT", "char_span": [0, 12], "tok_span": [0, 4]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "2014", "type": "DEFAULT", "char_span": [87, 91], "tok_span": [22, 23]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "50000", "type": "DEFAULT", "char_span": [49, 54], "tok_span": [15, 17]}]}

英文的分词方法不一样,所以需要自己重新写一下解码的逻辑。你看看从这个角度能否解决。

@iamxiongwei163 大佬您好!很抱歉打扰您!我最近在学习关系抽取的相关知识,在尝试在作者的模型上跑webnlg数据集,但我的f1值几乎只有0.08左右,想问下您知道问题可能出在哪里吗?如果可以的话可以给我些指导和建议吗?给您添麻烦了,一直没有什么头绪,会是我数据集的问题吗?我也尝试调整了参数,但好像没什么用处,打扰您了。 image 数据集1条:{"text": "Peter Stöger is manager of 1 . FC Köln which has 50000 members and participated in the 2014 season .", "id": "train_0", "spo_list": [{"subject": "1 . FC Köln", "object": "Peter Stöger", "subj_char_span": [27, 38], "obj_char_span": [0, 12], "predicate": "manager", "subj_tok_span": [7, 13], "obj_tok_span": [0, 4]}, {"subject": "1 . FC Köln", "object": "2014", "subj_char_span": [27, 38], "obj_char_span": [87, 91], "predicate": "season", "subj_tok_span": [7, 13], "obj_tok_span": [22, 23]}, {"subject": "1 . FC Köln", "object": "50000", "subj_char_span": [27, 38], "obj_char_span": [49, 54], "predicate": "numberOfMembers", "subj_tok_span": [7, 13], "obj_tok_span": [15, 17]}], "entity_list": [{"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "Peter Stöger", "type": "DEFAULT", "char_span": [0, 12], "tok_span": [0, 4]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "2014", "type": "DEFAULT", "char_span": [87, 91], "tok_span": [22, 23]}, {"text": "1 . FC Köln", "type": "DEFAULT", "char_span": [27, 38], "tok_span": [7, 13]}, {"text": "50000", "type": "DEFAULT", "char_span": [49, 54], "tok_span": [15, 17]}]} 你这个数据集不是casrel原文给的那个数据集吧?你去搜一下casrel模型给的数据集试试。

2019hong commented 3 years ago

@iamxiongwei163 谢谢您!!!我尝试了casrel源码中webnlg在build_data这一步骤前的dev.json、train.json、test.json 获得了76%的f1值~~!!! 但是这个结果是在我尝试按照您说的添加'[unused]“后的结果,想问下,添加'[unused]“大体是这个思路吗~??打扰您了!~ image

dr-imp commented 2 years ago

@iamxiongwei163 谢谢您!!!我尝试了casrel源码中webnlg在build_data这一步骤前的dev.json、train.json、test.json 获得了76%的f1值~~!!! 但是这个结果是在我尝试按照您说的添加'[unused]“后的结果,想问下,添加'[unused]“大体是这个思路吗~??打扰您了!~ image

请问你研究出来要怎么改动 能在英文数据上work么 求指教!