Open gjy-code opened 2 years ago
你好,解决了吗,我用自己的数据集也不行
哈哈哈 ,我也是这个问题,解决了吗?我感觉是中文的问题。
@guantao18 应该是中文混着英文或者其他字符的问题
@Josson 是的,这是bert的wordpiece导致的问题,英文和数字bert是按照最长匹配的,如果标注不是按照这个原则标的话就会导致分词前后的pos错位。解决办法是把标注数据按照bert分词规则重新分一遍做标注或者给带字母或数字的前面都加#,这样可以训练但不知道会不会引起新的问题。
@guantao18 请问有没有遇到运行mrc-ner的脚本时必须用0块显卡的问题啊,该怎么解决?
@Josson 脚本中不指定显卡id,直接删除掉,程序会自动找可用显卡的。要是不用多卡训练就设置参数gpus="1"即可。
@guantao18 应该是中文混着英文或者其他字符的问题
你好,请问现在你解决了吗,是怎么解决的?
@Josson 是的,这是bert的wordpiece导致的问题,英文和数字bert是按照最长匹配的,如果标注不是按照这个原则标的话就会导致分词前后的pos错位。解决办法是把标注数据按照bert分词规则重新分一遍做标注或者给带字母或数字的前面都加#,这样可以训练但不知道会不会引起新的问题。
你好,请问你解决了吗?
@gjy-code 我换了一种tokenizer的方式,不用wordpiece,可以运行了
@guantao18 我删掉之后,0号卡被别人用了还是报gpu0内存不够的错
@Josson 把max-length改小一些,200以下;还不行就减小batch-size。还不行就等别人用完。
@Josson 你用的哪个tokenizer
如果是英文的文本的话可以在预处理的时候将多个空格都处理成一个空格:# here !!! fix the problem
merged_multi_span_data = []
for p in data['data'][0]['paragraphs']:
for ques in p['qas']:
p['context'] = " ".join(p['context'].split()) # here !!!
current_example = {"id": len(merged_multi_span_data) + 1, "query": ques['question'],
"context": p['context'], "start_position": [], "end_position": [],
"span_position": [], "is_impossible": False}
for ans in ques['answers']:
ans['text'] = " ".join(ans['text'].split()) # here !!!
ans_tokens = ans['text'].lower().split()
context_tokens = p['context'].lower().split()
ans_text = " ".join(ans_tokens)
context_text = " ".join(context_tokens)
start = p['context'][:context_text.index(ans_text)].count(" ")
end = start + ans['text'].count(" ")
current_example['start_position'].append(start)
current_example['end_position'].append(end)
current_example['span_position'].append("{};{}".format(start, end))
merged_multi_span_data.append(current_example)
@gjy-code @guantao18 @bannima @liulizuel @Josson 各位大佬,请教一下! 我是新手还不太明白,老师突然丢了这个 github 的项目让我们研究。 你们都是怎么准备自己的数据集的,我看代码里好多地方用了绝对路径写的数据集文件,是不是都要修改的,每个路径下的数据集具体要去哪里找的,我只想让代码成功跑起来先。
/data2/wangshuhe/gpt3_ner/gpt3-data/ontonotes5_mrc
/data2/wangshuhe/gpt3_ner/gpt3-data/ontonotes5_mrc/test.100.simcse.dev.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/mrc-ner.test.100
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/test.100.simcse.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/test.random.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/low_resource
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/low_resource/test.10000.simcse.32.knn.jsonl
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.knn.sequence.fullprompt
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.entity.knn.sequence.fullprompt
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.entity.rectify.knn.sequence.fullprompt
/data2/wangshuhe/gpt3_ner/gpt3-data/conll_mrc/100-results/openai.32.knn.sequence.fullprompt.verified
/nfs1/shuhe/gpt3-ner/features/conll03
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/test.100.verify.knn.jsonl
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/test.verify.knn.jsonl
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/mrc-ner.train.dev
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/text-3/
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/text-3/openai.17.knn.train.dev.sequence.fullprompt
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_2003/text-full/openai.15.knn.train.dev.sequence.fullprompt
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll_bert
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll
/nfs1/shuhe/gpt3-ner/gpt3-data/en_conll/results.tmp
/nfs1/shuhe/gpt3-ner/origin_data/conll03_mrc
/nfs1/shuhe/gpt3-nmt/sup-simcse-roberta-large
/nfs1/shuhe/gpt3-nmt/data/en-fr/dev.en
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding/test.100.full.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding_sorted
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/start_word_embedding_sorted/test.full.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/low_resource
/nfs/shuhe/gpt3-ner/gpt3-data/conll_mrc/low_resource/low_resource_1_knn/test.simcse.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/ontonotes5_mrc/
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/test.embedding.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/start_word_embedding
/nfs/shuhe/gpt3-ner/gpt3-data/zh_onto4/start_word_embedding/test.mrc.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/zh_msra
/nfs/shuhe/gpt3-ner/gpt3-data/zh_msra/test.embedding.knn.jsonl
/nfs/shuhe/gpt3-ner/gpt3-data/ace2004/
/nfs/shuhe/gpt3-ner/gpt3-data/ace2005/
/nfs/shuhe/gpt3-ner/gpt3-data/genia/
/nfs/shuhe/gpt3-ner/models/text2vec-base-chinese
/home/wangshuhe/gpt-ner/openai_access/low_resource_data/conll_en
/home/wangshuhe/gpt-ner/openai_access/low_resource_data/conll_en/test.8.embedding.knn.jsonl
这些文件夹或文件都是从哪里下载或者获取的,分别要放在哪些位置?
我用自己的mrc格式数据集报错 Traceback (most recent call last): File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/train/mrc_ner_trainer.py", line 430, in
main()
File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/train/mrc_ner_trainer.py", line 417, in main
trainer.fit(model)
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
self.accelerator_backend.train(model)
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train
results = self.trainer.run_pretrain_routine(model)
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
self.run_training_epoch()
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 479, in run_training_epoch
enumerate(_with_is_last(train_dataloader)), "get_train_batch"
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers.py", line 78, in profile_iterable
value = next(iterator)
File "/home/amax/py36env/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 1323, in _with_is_last
for val in it:
File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/amax/py36env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/datasets/mrc_ner_dataset.py", line 96, in getitem
new_end_positions = [origin_offset2token_idx_end[end] for end in end_positions]
File "/home/amax/work/gjy/mrc-for-flat-nested-ner-master/datasets/mrc_ner_dataset.py", line 96, in
new_end_positions = [origin_offset2token_idx_end[end] for end in end_positions]
KeyError: 46