发现BuildData预处理长样本内存溢出的问题

还有想求教一个问题是，明明我的原数据已经有char_span了，但是在preprocess/build_data_config.yaml中我修改add_char_span为false会报错：

File "/nlp_data/hyy/tplinker/common/utils.py", line 337, in add_tok_span rel["obj_tok_span"] = char_span2tok_span(obj_char_span, char2tok_span) File "/nlp_data/hyy/tplinker/common/utils.py", line 327, in char_span2tok_span tok_span = [tok_span_list[0][0], tok_span_list[-1][1]] IndexError: list index out of range

我使用BuildData前的数据格式如下：

[ { "id": "test_0", "text": "1.血压控制目标......此处省略5000-10000字......第三医院\n内分泌乖卜1\n", "relation_list": [ { "subject": "血压", "subj_char_span": [ 2, 4 ], "object": "糖尿病", "obj_char_span": [ 9, 12 ], "predicate": "Test_Disease" }, 此处省略几百个关系 ], "entity_list": [ { "text": "2型糖尿病", "type": "Disease", "char_span": [ 411, 416 ] }, 此处省略几百个实体 ],

这个你自己设个断点看看什么问题越界。已经有char span就不要自动加char span了，因为默认有关系的两个实体，所有对应的char_span都会加上这个关系，如果你的实体和关系数量很大的话，自然会引入很多冗余的（或者错误的）关系。build data只是用来加char span和token span的，你可以根据自己数据的情况和使用的encoder自行添加token span。

运行BuildData之前，我自己已经检查过char_span了，用char_span对文本切片都能和每个实体名对应上。但是我注意到我的文本长度到了BuildData里面utils.py的clean_data_wo_span函数之后，文本长度和我之前的就对不上了。但我明明separate_char_by_white设为了false，还是被去掉了连续的空格。所以去掉空格之后，char_span就对不上了。

def clean_data_wo_span(self, ori_data, separate = False, data_type = "train"):
        '''
        rm duplicate whitespaces
        and add whitespaces around tokens to keep special characters from them
        '''
        def clean_text(text):
            text = re.sub("\s+", " ", text).strip()
            if separate:
                text = re.sub("([^A-Za-z0-9])", r" \1 ", text)
                text = re.sub("\s+", " ", text).strip()
            return text

        for sample in tqdm(ori_data, desc = "clean data"):
            sample["text"] = clean_text(sample["text"])
            if data_type == "test":
                continue
            for rel in sample["relation_list"]:
                rel["subject"] = clean_text(rel["subject"])
                rel["object"] = clean_text(rel["object"])
        return ori_data

您确定clearn_text函数里的 text = re.sub("\s+", " ", text).strip() 应该写在if separate:之外嘛？谢谢

Apr 17: 注释掉“text = re.sub("\s+", " ", text).strip()”之后就没有问题了。

131250208 / TPlinker-joint-extraction

发现BuildData预处理长样本内存溢出的问题 #40