convert_gec_data_to_parsing_data_chinese.py疑问

sunbo1999 commented 1 year ago

`class Tree_Transformer: """根据m2格式信息完成句法树变形，需要提前提供Golden句法树

Arguments:
    Errorifier {Class} -- 父类，仅做加噪
"""

def __init__(self, sentence_conllx: list, m2: list):
    """构造函数

    Arguments:
        sentence_conllx {list} -- CoNLLx格式的句子
    """
    self.original_sentence = " ".join([t[1] for t in sentence_conllx])

    assert self.original_sentence == m2[0][2:], print(self.original_sentence, m2[0][2:])
    self.edits = m2[1:]
    self.original_conllx = copy.deepcopy(sentence_conllx)
    self.sentence = self.original_sentence
    self.conllx = sentence_conllx
    self.tokenized = None
    self.tokenize()
    self.parse_conllx()`

其中，sentenceconllx是CoNLLx格式的句子，是stanza依存分析后的结果，格式如下： 1 他 2 nsubj 2 是 0 root 3 个 8 clf 4 很 6 advmod 5 要 8 rcmod 6 强 8 rcmod 7 的 5 cpm 8 人 2 attr 9 ， 2 punct 10 想 2 conj 11 不到 10 dep 12 被 16 pass 13 这 15 det 14 个 13 clf 15 问题 16 nsubj 16 难住 10 dep 17 了 16 asp 18 。 2 punct _ 那么，original_sentence应该是整个句子分词后，按照空格连接一起；而m2是按照字符级别的编辑操作，格式如下： S 他是个很要强的人，想不到被这个问题难住了。 T0-A0 他是个很要强的人，想不到被这个问题把他难住了。 A 17 17|||M|||把他|||REQUIRED|||-NONE-|||0 所以，original_sentence不可能和 m2[0][2:]是一样的，是代码存在什么问题，还是我哪个地方数据处理出了问题？

HillZhang1999 commented 1 year ago

Sorry，我们的流程写得可能有点问题。这里的tgt端的句法树，应该是一个在clean treebank上训练的字级别中文off-the-shelf parser提供的。需要先把clean treebank（如CTB-7）里的tree转成字级别，然后训练一个off-the-shelf parser。我稍后会把我们训练好的字符级别off-the-shelf parser提供出来。

HillZhang1999 commented 1 year ago

您也可以自己重新训练一个字级别off-the-shelf parser（比如用stanza），只需要先把原始的CTB语料用：https://github.com/HillZhang1999/SynGEC/blob/main/utils/convert_chinese_treebank_from_word_to_char.py 转成字级别，再用相同的参数在新的字级别CTB上训练即可。但需要注意，stanza预测时，分词（字）粒度一定要和m2是一致的，即先分字（用segment_bert），再做parsing，不然还是会出现您说的mismatch的现象。

sunbo1999 commented 1 year ago

您好，训练好的字符级别off-the-shelf parser会提供吗

HillZhang1999 commented 1 year ago

您好，训练好的字符级别off-the-shelf parser会提供吗

明晚之前会上传

sunbo1999 commented 1 year ago

好的，感谢回复

HillZhang1999 commented 1 year ago

好的，感谢回复

您好，已上传，请查看https://github.com/HillZhang1999/SynGEC#gopar 下载biaffine-dep-electra-zh-char并使用supar进行预测即可（具体流程可以参考https://github.com/HillZhang1999/SynGEC/blob/main/bash/chinese_exp/pipeline_gopar.sh）

sunbo1999 commented 1 year ago

您好，我这边加载biaffine-dep-electra-zh-char存在一点问题

from supar import Parser
path = './emnlp2022_syngec_biaffine-dep-electra-zh-gopar'
path = './emnlp2022_syngec_biaffine-dep-electra-zh-char'
dep = Parser.load(path)
res = dep.predict("语法纠错", verbose=False, lang='zh', prob=True)

报错如下： AttributeError: Can't get attribute 'AttachmentMetricWithGED' on <module 'supar.utils.metric' 同样的方式加载您之前发布的biaffine-dep-electra-zh-gopar就没问题，请问是您模型那边上传的有什么问题吗？还是我这边加载模型有什么问题呢？

sunbo1999 commented 1 year ago

另外，我发现，如果输入的中文是按空格分隔的句子，依存分析的结果就是字级别的，不知道这种处理和您之前说的字符级别off-the-shelf parser是否等效？

dep = Parser.load("biaffine-dep-electra-zh")
res = dep.predict("他 是 个 很 要 强 的 人 ， 想 不 到 被 这 个 问 题 难 住 了 。", verbose=False, lang='zh', prob=True)

1 他 2 是 3 个 4 很 5 要 6 强 7 的 8 人 9 ， 10 想 11 不 12 到 13 被 14 这 15 个 16 问 17 题 18 难 19 住 20 了 21 。 2 nsubj 0 root 8 clf 6 advmod 8 rcmod 8 rcmod 5 cpm 2 attr 2 punct 2 conj 10 dep 10 dep 18 pass 17 det 14 clf 17 nn 18 nsubj 10 dep 18 rcomp 18 asp 2 punct

HillZhang1999 commented 1 year ago

另外，我发现，如果输入的中文是按空格分隔的句子，依存分析的结果就是字级别的，不知道这种处理和您之前说的字符级别off-the-shelf parser是否等效？
dep = Parser.load("biaffine-dep-electra-zh")
res = dep.predict("他 是 个 很 要 强 的 人 ， 想 不 到 被 这 个 问 题 难 住 了 。", verbose=False, lang='zh', prob=True)
1 他 2 nsubj 2 是 0 root 3 个 8 clf 4 很 6 advmod 5 要 8 rcmod 6 强 8 rcmod 7 的 5 cpm 8 人 2 attr 9 ， 2 punct 10 想 2 conj 11 不 10 dep 12 到 10 dep 13 被 18 pass 14 这 17 det 15 个 14 clf 16 问 17 nn 17 题 18 nsubj 18 难 10 dep 19 住 18 rcomp 20 了 18 asp 21 。 2 punct

这个parser训练时是词级别的，输入强行是字级别的话结果可能是不正确的。

HillZhang1999 commented 1 year ago

您好，我这边加载biaffine-dep-electra-zh-char存在一点问题
from supar import Parser
path = './emnlp2022_syngec_biaffine-dep-electra-zh-gopar'
path = './emnlp2022_syngec_biaffine-dep-electra-zh-char'
dep = Parser.load(path)
res = dep.predict("语法纠错", verbose=False, lang='zh', prob=True)
报错如下： AttributeError: Can't get attribute 'AttachmentMetricWithGED' on <module 'supar.utils.metric' 同样的方式加载您之前发布的biaffine-dep-electra-zh-gopar就没问题，请问是您模型那边上传的有什么问题吗？还是我这边加载模型有什么问题呢？

您好，刚刚复现了下，这确实是一个bug。解决方案是：请您在metric.py内新建一个名为AttachmentMetricWithGED的空类即可。例如：报错命令为AttributeError: Can't get attribute 'AttachmentMetricWithGED' on <module 'supar.utils.metric' from '/home/pai/envs/test/lib/python3.8/site-packages/supar/utils/metric.py'>，那么metric.py路径为：'/home/pai/envs/test/lib/python3.8/site-packages/supar/utils/metric.py'>，插入以下代码：

class AttachmentMetricWithGED(Metric):
    def __init__(self, eps=1e-12):
        super().__init__()

sunbo1999 commented 1 year ago

您好，我这边加载biaffine-dep-electra-zh-char模型之后的输出，好像还不是完全的字级别的

path = './emnlp2022_syngec_biaffine-dep-electra-zh-char'
dep = Parser.load(path)
res = dep.predict(["大 型 国 营 企 业 发 生 的 损 失 ， 很 大 程 度 上 是 与 企 业 内 控 制 机 制 缺 失 ， 监 督 管 理 不 严 有 关 。"], verbose=False, lang='zh', prob=True)

1 大 2 app 2 型 6 amod 3 国 4 app 4 营 6 amod 5 企 6 app 6 业 8 nsubj 7 发 8 app 8 生 11 rcmod 9 的 8 cpm 10 损 11 app 11 失， 33 punct 12 很 13 app 13 大程 14 app 14 度 15 lobj 15 上是 33 loc 16 与 33 prep 17 企 18 app 18 业 19 lobj 19 内 23 dep 20 控 21 app 21 制 23 nn 22 机 23 app 23 制 25 nsubj 24 缺 25 app 25 失 16 pccomp 26 ， 25 punct 27 监 28 app 28 督 30 nn 29 管 30 app 30 理 31 nsubj 31 不严 25 conj 32 有 33 app 33 关 0 root 34 。 33 punct

但需要注意，stanza预测时，分词（字）粒度一定要和m2是一致的，即先分字（用segment_bert），再做parsing，不然还是会出现您说的mismatch的现象。参考您之前的提醒，我这里面已经是分字了，就是句子按照空格分隔，但是结果还是有词级别的？

HillZhang1999 commented 1 year ago

使用parser请参考：https://github.com/HillZhang1999/SynGEC/blob/main/src/src_gopar/parse.py 我测试不会出现您所说的现象：

1       大      _       _       _       _       2       app     _       _
2       型      _       _       _       _       6       amod    _       _
3       国      _       _       _       _       4       app     _       _
4       营      _       _       _       _       6       amod    _       _
5       企      _       _       _       _       6       app     _       _
6       业      _       _       _       _       8       nsubj   _       _
7       发      _       _       _       _       8       app     _       _
8       生      _       _       _       _       11      rcmod   _       _
9       的      _       _       _       _       8       cpm     _       _
10      损      _       _       _       _       11      app     _       _
11      失      _       _       _       _       37      dep     _       _
12      ，      _       _       _       _       37      punct   _       _
13      很      _       _       _       _       14      advmod  _       _
14      大      _       _       _       _       16      amod    _       _
15      程      _       _       _       _       16      app     _       _
16      度      _       _       _       _       17      lobj    _       _
17      上      _       _       _       _       37      loc     _       _
18      是      _       _       _       _       37      cop     _       _
19      与      _       _       _       _       37      prep    _       _
20      企      _       _       _       _       21      app     _       _
21      业      _       _       _       _       22      lobj    _       _
22      内      _       _       _       _       26      dep     _       _
23      控      _       _       _       _       24      app     _       _
24      制      _       _       _       _       26      nn      _       _
25      机      _       _       _       _       26      app     _       _
26      制      _       _       _       _       28      nsubj   _       _
27      缺      _       _       _       _       28      app     _       _
28      失      _       _       _       _       19      pccomp  _       _
29      ，      _       _       _       _       28      punct   _       _
30      监      _       _       _       _       31      app     _       _
31      督      _       _       _       _       33      nn      _       _
32      管      _       _       _       _       33      app     _       _
33      理      _       _       _       _       35      nsubj   _       _
34      不      _       _       _       _       35      neg     _       _
35      严      _       _       _       _       28      conj    _       _
36      有      _       _       _       _       37      app     _       _
37      关      _       _       _       _       0       root    _       _
38      。      _       _       _       _       37      punct   _       _

HillZhang1999 / SynGEC

convert_gec_data_to_parsing_data_chinese.py疑问 #8