How to extract the candidate answer of BERT features?

jayzhu02 commented 2 years ago

Hi, I want to extract the candidate's answer features from the qas_feats. I saw the format of the sentence feature: (1): [CLS] question [SEP] option_0 [SEP] and I tried to extract it by the length of the question and the candidate's answers. But I found that the length of BERT features sometimes is not corresponding to the length of original text.

Example(in val.csv):

Question: why did the boy pick up one present from the group of them and move to the sofa
a0:share with the girl  
a1:approach lady sitting there  
a2: unwrap it  
a3: playing with toy train 
a4: gesture something

The Length of qas should be [25, 25, 23, 25, 23], but actual length is [25, 28, 25, 25, 24]. So is there any way to extract it or How does the tokenizer works?

Thanks for your excellent work!

doc-doc commented 2 years ago

Hi, you should use the related bert tokenizer (refer to pytorch-pretrained-BERT ) to get the actual tokens

from pytorch_pretrained_bert.tokenization import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
txt = 'test tokenizer'
tokens = tokenizer.tokenize(txt)
print(tokens)

jayzhu02 commented 2 years ago

I'd used the BertTOkenizer to get the token but the length of tokens and bert_features are not the same length.

# This is the second QA pair in val.csv (qid '4882821564_1')
from pytorch_pretrained_bert.tokenization import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
question = '[CLS] why did the boy pick up one present from the group of them and move to the sofa [SEP] '
a = ["share with the girl [SEP]",
     "approach lady sitting there [SEP]",
     "unwrap it [SEP]",
     "playing with toy train [SEP]",
     "gesture something [SEP]"]

q_token = tokenizer.tokenize(question)

for answer in a:
    a_token = tokenizer.tokenize(answer)
    qas_token = q_token + a_token
    print(qas_token, len(qas_token))

The output is:

['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'share', 'with', 'the', 'girl', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'approach', 'lady', 'sitting', 'there', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'un', '##wr', '##ap', 'it', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'playing', 'with', 'toy', 'train', '[SEP]'] 25
['[CLS]', 'why', 'did', 'the', 'boy', 'pick', 'up', 'one', 'present', 'from', 'the', 'group', 'of', 'them', 'and', 'move', 'to', 'the', 'sofa', '[SEP]', 'gesture', 'something', '[SEP]'] 23

But the length of the bert_feats of this QA pair from _bert_ftval.h5 is [25, 28, 25, 25, 24], which means the a1 and a4 candidate qas aren't matched.

doc-doc commented 2 years ago

In that case, I don't know the reason. Perhaps there is space or other stop words but was removed by me. You can finetune it yourself. The code for finetuning is given, and you can get it by searching for my reply to other issues.

jayzhu02 commented 2 years ago

Hi, I checked some other data and it seems that just a few of them have this problem. Thanks again!

doc-doc / NExT-QA

How to extract the candidate answer of BERT features? #13