131250208 / TPlinker-joint-extraction

438 stars 94 forks source link

TPLinkerPlus: find cuda error at training stage when add special tokens for bert #47

Closed jarork closed 3 years ago

jarork commented 3 years ago

Hi author, I'm trying to add some special tokens for bert tokenizer, it works fine with DataBuilder. However, the cuda error is found at the training stage. I've modified the bert tokenizer like this:


if config["encoder"] == "BERT": tokenizer = BertTokenizerFast.from_pretrained(config["bert_path"], add_special_tokens = True, do_lower_case = False)

The special tokens are added here.....

tokenizer.add_special_tokens({'additional_special_tokens':[
    "mm³","cm³","mm²","cm²","mm3","cm3","mm2","cm2",
    "cm", "mm", "ml", "CM", "MM", "ML", "x", "*",
    "Hu", "hu", "HU", "Se", "se", "SE", 
    "Image", "image", "IMAGE", "Im", "im", "IM"
    ]})
data_maker = DataMaker4Bert(tokenizer, handshaking_tagger)

here goes the cuda error info: /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [127,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [127,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed ........ /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [73,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [73,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed. THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered

jarork commented 3 years ago

When I switch my device from GPU to CPU, it appears: Exception has occurred: IndexError index out of range in self File "/nlp_data/hyy/tplinker/tplinker_plus/tplinker_plus.py", line 428, in forward context_outputs = self.encoder(input_ids, attention_mask, token_type_ids) File "/nlp_data/hyy/tplinker/tplinker_plus/evaluation.py", line 357, in predict batch_token_type_ids, File "/nlp_data/hyy/tplinker/tplinker_plus/evaluation.py", line 504, in pred_sample_list = predict(short_data, ori_test_data)

jarork commented 3 years ago

it seems the input dimension is changed after I have added those special tokens, do you have any ideas where I should modify the model in order to use the special tokens?

Thanks

jarork commented 3 years ago

The problem has been solved by adding those customized tokens in bert vocab.txt.