ShannonAI / mrc-for-flat-nested-ner

Code for ACL 2020 paper `A Unified MRC Framework for Named Entity Recognition`
662 stars 118 forks source link

换成自己的数据集报错,不能训练 #108

Open gjy-code opened 2 years ago

gjy-code commented 2 years ago

我用自己的mrc格式数据集报错 Traceback (most recent call last): File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/train/mrc_ner_trainer.py", line 430, in main() File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/train/mrc_ner_trainer.py", line 417, in main trainer.fit(model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, *args, **kwargs) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit self.accelerator_backend.train(model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train results = self.trainer.run_pretrain_routine(model) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine self.train() File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train self.run_training_epoch() File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 479, in run_training_epoch enumerate(_with_is_last(train_dataloader)), "get_train_batch" File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers.py", line 78, in profile_iterable value = next(iterator) File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 1323, in _with_is_last for val in it: File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/datasets/mrc_ner_dataset.py", line 96, in getitem new_end_positions = [origin_offset2token_idx_end[end] for end in end_positions] File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/datasets/mrc_ner_dataset.py", line 96, in new_end_positions = [origin_offset2token_idx_end[end] for end in end_positions]

KeyError: 46

Josson commented 2 years ago

你好,解决了吗,我用自己的数据集也不行

guantao18 commented 2 years ago

哈哈哈 ,我也是这个问题,解决了吗?我感觉是中文的问题。

Josson commented 2 years ago

@guantao18 应该是中文混着英文或者其他字符的问题

guantao18 commented 2 years ago

@Josson 是的,这是bert的wordpiece导致的问题,英文和数字bert是按照最长匹配的,如果标注不是按照这个原则标的话就会导致分词前后的pos错位。解决办法是把标注数据按照bert分词规则重新分一遍做标注或者给带字母或数字的前面都加#,这样可以训练但不知道会不会引起新的问题。

Josson commented 2 years ago

@guantao18 请问有没有遇到运行mrc-ner的脚本时必须用0块显卡的问题啊,该怎么解决?

guantao18 commented 2 years ago

@Josson 脚本中不指定显卡id,直接删除掉,程序会自动找可用显卡的。要是不用多卡训练就设置参数gpus="1"即可。

gjy-code commented 2 years ago

@guantao18 应该是中文混着英文或者其他字符的问题

你好,请问现在你解决了吗,是怎么解决的?

gjy-code commented 2 years ago

@Josson 是的,这是bert的wordpiece导致的问题,英文和数字bert是按照最长匹配的,如果标注不是按照这个原则标的话就会导致分词前后的pos错位。解决办法是把标注数据按照bert分词规则重新分一遍做标注或者给带字母或数字的前面都加#,这样可以训练但不知道会不会引起新的问题。

你好,请问你解决了吗?

Josson commented 2 years ago

@gjy-code 我换了一种tokenizer的方式,不用wordpiece,可以运行了

Josson commented 2 years ago

@guantao18 我删掉之后,0号卡被别人用了还是报gpu0内存不够的错

guantao18 commented 2 years ago

@Josson 把max-length改小一些,200以下;还不行就减小batch-size。还不行就等别人用完。

bannima commented 2 years ago

@Josson 你用的哪个tokenizer

liulizuel commented 2 years ago

如果是英文的文本的话可以在预处理的时候将多个空格都处理成一个空格:# here !!! fix the problem

        merged_multi_span_data = []
        for p in data['data'][0]['paragraphs']:
            for ques in p['qas']:
                p['context'] = " ".join(p['context'].split())   # here !!!
                current_example = {"id": len(merged_multi_span_data) + 1, "query": ques['question'],
                                   "context": p['context'], "start_position": [], "end_position": [],
                                   "span_position": [], "is_impossible": False}
                for ans in ques['answers']:
                    ans['text'] = " ".join(ans['text'].split())  # here !!!
                    ans_tokens = ans['text'].lower().split()
                    context_tokens = p['context'].lower().split()
                    ans_text = " ".join(ans_tokens)
                    context_text = " ".join(context_tokens)

                    start = p['context'][:context_text.index(ans_text)].count(" ")
                    end = start + ans['text'].count(" ")
                    current_example['start_position'].append(start)
                    current_example['end_position'].append(end)
                    current_example['span_position'].append("{};{}".format(start, end))

                merged_multi_span_data.append(current_example)
YowFung commented 8 months ago

@gjy-code @guantao18 @bannima @liulizuel @Josson 各位大佬,请教一下! 我是新手还不太明白,老师突然丢了这个 github 的项目让我们研究。 你们都是怎么准备自己的数据集的,我看代码里好多地方用了绝对路径写的数据集文件,是不是都要修改的,每个路径下的数据集具体要去哪里找的,我只想让代码成功跑起来先。

/data2/wangshuhe/gpt3_ner/gpt3-data/ontonotes5_mrc
/data2/wangshuhe/gpt3_ner/gpt3-data/ontonotes5_mrc/test.100.simcse.dev.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/mrc-ner.test.100
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/test.100.simcse.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/test.random.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/low_resource
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/low_resource/test.10000.simcse.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.knn.sequence.fullprompt
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.entity.knn.sequence.fullprompt
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.entity.rectify.knn.sequence.fullprompt
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.knn.sequence.fullprompt.verified
/nfs1/shuhe/gpt3-ner/features/conll03
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/test.100.verify.knn.jsonl
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/test.verify.knn.jsonl
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/mrc-ner.train.dev
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/text-3/
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/text-3/openai.17.knn.train.dev.sequence.fullprompt
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/text-full/openai.15.knn.train.dev.sequence.fullprompt
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_bert
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll/results.tmp
/nfs1/shuhe/gpt3-ner/origin_data/conll03_mrc
/nfs1/shuhe/gpt3-nmt/sup-simcse-roberta-large
/nfs1/shuhe/gpt3-nmt/data/en-fr/dev.en
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding/test.100.full.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding_sorted
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding_sorted/test.full.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/low_resource
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/low_resource/low_resource_1_knn/test.simcse.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/ontonotes5_mrc/
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/test.embedding.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/start_word_embedding
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/start_word_embedding/test.mrc.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/zh_msra
/nfs/shuhe/gpt3-ner/gpt3-data/zh_msra/test.embedding.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/ace2004/
/nfs/shuhe/gpt3-ner/gpt3-data/ace2005/
/nfs/shuhe/gpt3-ner/gpt3-data/genia/
/nfs/shuhe/gpt3-ner/models/text2vec-base-chinese
/home/wangshuhe/gpt-ner/openai_access/low_resource_data/conll_en
/home/wangshuhe/gpt-ner/openai_access/low_resource_data/conll_en/test.8.embedding.knn.jsonl

这些文件夹或文件都是从哪里下载或者获取的,分别要放在哪些位置?