Closed jayzhu02 closed 2 years ago
Hi, you should use the related bert tokenizer (refer to pytorch-pretrained-BERT ) to get the actual tokens
from pytorch_pretrained_bert.tokenization import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
txt = 'test tokenizer'
tokens = tokenizer.tokenize(txt)
print(tokens)
I'd used the BertTOkenizer to get the token but the length of tokens and bert_features are not the same length.
# This is the second QA pair in val.csv (qid '4882821564_1')
from pytorch_pretrained_bert.tokenization import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
question = '[CLS] why did the boy pick up one present from the group of them and move to the sofa [SEP] '
a = ["share with the girl [SEP]",
"approach lady sitting there [SEP]",
"unwrap it [SEP]",
"playing with toy train [SEP]",
"gesture something [SEP]"]
q_token = tokenizer.tokenize(question)
for answer in a:
a_token = tokenizer.tokenize(answer)
qas_token = q_token + a_token
print(qas_token, len(qas_token))
The output is:
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'share', 'with', 'the', 'girl', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'approach', 'lady', 'sitting', 'there', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'un', '##wr', '##ap', 'it', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'playing', 'with', 'toy', 'train', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'gesture', 'something', '[SEP]'] 23
But the length of the bert_feats of this QA pair from _bert_ftval.h5 is [25, 28, 25, 25, 24], which means the a1 and a4 candidate qas aren't matched.
In that case, I don't know the reason. Perhaps there is space or other stop words but was removed by me. You can finetune it yourself. The code for finetuning is given, and you can get it by searching for my reply to other issues.
Hi, I checked some other data and it seems that just a few of them have this problem. Thanks again!
Hi, I want to extract the candidate's answer features from the qas_feats. I saw the format of the sentence feature:
(1): [CLS] question [SEP] option_0 [SEP]
and I tried to extract it by the length of the question and the candidate's answers. But I found that the length of BERT features sometimes is not corresponding to the length of original text.Example(in val.csv):
The Length of qas should be [25, 25, 23, 25, 23], but actual length is [25, 28, 25, 25, 24]. So is there any way to extract it or How does the tokenizer works?
Thanks for your excellent work!