wikievents 等英文数据集实验

xxllp commented 2 years ago

准备实验个英文数据集不知道作者是否在wikievents 上面跑出结果因为看 scripts 里面的预训练模型名称都是中文的 ~~~

Spico197 commented 2 years ago

后期适配了wikievents数据集，不过没有调整参数。预训练模型改成英文即可。

https://github.com/Spico197/DocEE/blob/main/scripts/run_ptpcg_wikievents_wTgg.sh

xxllp commented 2 years ago

我这边换成了英文的但是数据读取报了个错误

   inlcude_complementary_ents=self.include_complementary_ents_flag,
  File "/data/xxl/DocEE/dee/helper/dee.py", line 143, in __init__
    annguid, mspan, str(sent_mrange), sent_text
ValueError: GUID: scenario_en_kairos_14 span range is not correct, span=Prayuth Chan - ocha, range=(11, 15), sent=['[UNK]', 's', '[UNK]', 'o', 'f', '[UNK]', 'e', 'a', 'r', 'l', 'y', '[UNK]', '[UNK]', 'u', 'e', 's', 'd', 'a', 'y', '[UNK]', 't', 'h', 'e', 'r', 'e', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', '[UNK]', 'c', 'l', 'a', 'i', 'm', '[UNK]', 'o', 'f', '[UNK]', 'r', 'e', 's', 'p', 'o', 'n', 's', 'i', 'b', 'i', 'l', 'i', 't', 'y', '[UNK]', '.', '[UNK]', '[UNK]', 'r', 'a', 'y', 'u', 't', 'h', '[UNK]', '[UNK]', 'h', 'a', 'n', '[UNK]', '-', '[UNK]', 'o', 'c', 'h', 'a', '[UNK]', ',', '[UNK]', 't', 'h', 'e', '[UNK]', 'h', 'e', 'a', 'd', '[UNK]', 'o', 'f', '[UNK]', '[UNK]', 'h', 'a', 'i', 'l', 'a', 'n', 'd', '[UNK]', '’', '[UNK]', 's', '[UNK]', 'm', 'i', 'l', 'i', 't', 'a', 'r', 'y', '[UNK]', 'g', 'o', 'v', 'e', 'r', 'n', 'm', 'e', 'n', 't', '[UNK]', ',', '[UNK]', 's', 'a', 'i', 'd', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 't', 'h', 'e', '[UNK]', 'a', 'u', 't', 'h', 'o', 'r', 'i', 't', 'i', 'e', 's', '[UNK]', 'w', 'e', 'r', 'e', '[UNK]', 's', 'e', 'a', 'r', 'c', 'h', 'i', 'n', 'g', '[UNK]', 'f', 'o', 'r', '[UNK]', 'a', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 's', 'e', 'e', 'n', '[UNK]', 'o', 'n', '[UNK]', 'c', 'l', 'o', 's', 'e', 'd', '[UNK]', '-', '[UNK]', 'c', 'i', 'r', 'c', 'u', 'i', 't', '[UNK]', 'f', 'o', 'o', 't', 'a', 'g', 'e', '[UNK]', 'b', 'u', 't', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 'i', 't', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', 't', '[UNK]', 'c', 'l', 'e', 'a', 'r', '[UNK]', 'w', 'h', 'o', '[UNK]', 't', 'h', 'e', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 'w', 'a', 's', '[UNK]', ',', '[UNK]', 'n', 'e', 'w', 's', '[UNK]', 'a', 'g', 'e', 'n', 'c', 'i', 'e', 's', '[UNK]', 'r', 'e', 'p', 'o', 'r', 't', 'e', 'd', '[UNK]', '.']

xxllp commented 2 years ago

看起来是将单词分成字母了

Spico197 commented 2 years ago

run_mode为wikievents_w_tgg时，doc_lang为en，默认使用空格作为tokenize的依据。您dee包的版本是0.3.2吗？

https://github.com/Spico197/DocEE/blob/d6b585e29e5908b891e765066b96ff7642587e5a/dee/utils.py#L144-L157

xxllp commented 2 years ago

我是github 下的代码版本是对的

xxllp commented 2 years ago

是不是这个wikievent的数据处理的脚本哪里有点问题

Spico197 commented 2 years ago

线下测试的时候是可以正常跑通的。如果方便的话麻烦提供多一点信息给我，或者您也在本地debug一下。

xxllp commented 2 years ago

报错信息在上面
这个英文的句子 tokenizer.dee_tokenize 结果如下这个是正常的？ ['[UNK]', 's', '[UNK]', 'o', 'f', '[UNK]', 'e', 'a', 'r', 'l', 'y', '[UNK]', '[UNK]', 'u', 'e', 's', 'd', 'a', 'y', '[UNK]', 't', 'h', 'e', 'r', 'e', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', '[UNK]', 'c', 'l', 'a', 'i', 'm', '[UNK]', 'o', 'f', '[UNK]', 'r', 'e', 's', 'p', 'o', 'n', 's', 'i', 'b', 'i', 'l', 'i', 't', 'y', '[UNK]', '.', '[UNK]', '[UNK]', 'r', 'a', 'y', 'u', 't', 'h', '[UNK]', '[UNK]', 'h', 'a', 'n', '[UNK]', '-', '[UNK]', 'o', 'c', 'h', 'a', '[UNK]', ',', '[UNK]', 't', 'h', 'e', '[UNK]', 'h', 'e', 'a', 'd', '[UNK]', 'o', 'f', '[UNK]', '[UNK]', 'h', 'a', 'i', 'l', 'a', 'n', 'd', '[UNK]', '’', '[UNK]', 's', '[UNK]', 'm', 'i', 'l', 'i', 't', 'a', 'r', 'y', '[UNK]', 'g', 'o', 'v', 'e', 'r', 'n', 'm', 'e', 'n', 't', '[UNK]', ',', '[UNK]', 's', 'a', 'i', 'd', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 't', 'h', 'e', '[UNK]', 'a', 'u', 't', 'h', 'o', 'r', 'i', 't', 'i', 'e', 's', '[UNK]', 'w', 'e', 'r', 'e', '[UNK]', 's', 'e', 'a', 'r', 'c', 'h', 'i', 'n', 'g', '[UNK]', 'f', 'o', 'r', '[UNK]', 'a', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 's', 'e', 'e', 'n', '[UNK]', 'o', 'n', '[UNK]', 'c', 'l', 'o', 's', 'e', 'd', '[UNK]', '-', '[UNK]', 'c', 'i', 'r', 'c', 'u', 'i', 't', '[UNK]', 'f', 'o', 'o', 't', 'a', 'g', 'e', '[UNK]', 'b', 'u', 't', '[UNK]', 't', 'h', 'a', 't', '[UNK]', 'i', 't', '[UNK]', 'w', 'a', 's', '[UNK]', 'n', 'o', 't', '[UNK]', 'c', 'l', 'e', 'a', 'r', '[UNK]', 'w', 'h', 'o', '[UNK]', 't', 'h', 'e', '[UNK]', 'p', 'e', 'r', 's', 'o', 'n', '[UNK]', 'w', 'a', 's', '[UNK]', ',', '[UNK]', 'n', 'e', 'w', 's', '[UNK]', 'a', 'g', 'e', 'n', 'c', 'i', 'e', 's', '[UNK]', 'r', 'e', 'p', 'o', 'r', 't', 'e', 'd', '[UNK]', '.']

Spico197 commented 2 years ago

不正常，应该是以空格切分

xxllp commented 2 years ago

是的看起来是分割的时候有问题

xxllp commented 2 years ago

这个我刚才改好了但是后续的训练发现几轮下来预测的结果统计全部都是0哈是英文的结果哪里没对齐吗

Spico197 commented 2 years ago

我在本地重新试了一下，不应该有数据分割的问题，应该是可以直接正常训练的。方便告知一下您做了哪些改动吗？
全是0其实在WikiEvents上挺正常的，因为数据量太小，建议搭配预训练模型使用。如果像DuEE-fin和ChFinAnn一样使用随机初始化的embedding的话效果会非常差劲

xxllp commented 2 years ago

我是将 BertTokenizerForDocEE 里面的self.dee_tokenize = self.dee_space_tokenize 不判断语言了这个应该是在判断的时候识别还是中文的~~

你意思是加载哪个预训练模型，初始化加载的是bert 哈还是你训练后的模型吗这个没看到有吧

Spico197 commented 2 years ago

这个太奇怪了，我重新clone了repo，并且重新生成了数据，并没有遇到这个问题orz
脚本中有一个use_bertflag，可以改成True，会使用BERT+CRF的encoding方案。不过有可能会OOM，所以需要相应改下batch size等参数

xxllp commented 2 years ago

了解了~~~ 可能我本地代码哪里改了导致的这个

xxllp commented 2 years ago

我感觉问题不在这个地方应该是这个 self.dee_space_tokenize 后的结果很多都是 unk 实体里面也是的

xxllp commented 2 years ago

bert 也没啥结果

Spico197 commented 2 years ago

您是用cased还是uncased模型？如果是UNK比较多的话可以把所有字符串lower一下，然后用uncased，或者直接用cased试试

xxllp commented 2 years ago

用的uncased模型

xxllp commented 2 years ago

试了下貌似还是一样的几轮都是 0 不知道你本地最后跑出来的F1是多少

Spico197 commented 2 years ago

我只开debug模式测试了可以训练，暂无训练结果

xxllp commented 2 years ago

这样我感觉这个英文的数据集肯定是需要哪里继续改

xxllp commented 2 years ago

换了个数据集是有结果了但是结果不是很高这块要是想把unk的去掉如何整比较好

Spico197 / DocEE

wikievents 等英文数据集实验 #51