arkilpatel / SVAMP

NAACL 2021: Are NLP Models really able to Solve Simple Math Word Problems?
MIT License
116 stars 34 forks source link

the issue for the code in gts #8

Open dpbnnp opened 2 years ago

dpbnnp commented 2 years ago

I run the code in " gts", but I find a mistake when using the Roberta which in "contextual_embeddings.py ". ` class RobertaEncoder(nn.Module): def init(self, roberta_model = 'roberta-base', device = 'cuda:0 ', freeze_roberta = False): super(RobertaEncoder, self).init() self.roberta_layer = RobertaModel.from_pretrained(roberta_model) self.roberta_tokenizer = RobertaTokenizer.from_pretrained(roberta_model) self.device = device

    if freeze_roberta:
        for p in self.roberta_layer.parameters():
            p.requires_grad = False

def robertify_input(self, sentences):
    '''
    Preprocess the input sentences using roberta tokenizer and converts them to a torch tensor containing token ids
    '''
    # Tokenize the input sentences for feeding into RoBERTa
    all_tokens  = [['<s>'] + self.roberta_tokenizer.tokenize(sentence) + ['</s>'] for sentence in sentences]

    # Pad all the sentences to a maximum length
    input_lengths = [len(tokens) for tokens in all_tokens]
    max_length    = max(input_lengths)
    padded_tokens = [tokens + ['<pad>' for _ in range(max_length - len(tokens))] for tokens in all_tokens]

    # Convert tokens to token ids
    token_ids = torch.tensor([self.roberta_tokenizer.convert_tokens_to_ids(tokens) for tokens in padded_tokens]).to(self.device)

    # Obtain attention masks
    pad_token = self.roberta_tokenizer.convert_tokens_to_ids('<pad>')
    attn_masks = (token_ids != pad_token).long()

    return token_ids, attn_masks, input_lengths

def forward(self, sentences):
    '''
    Feed the batch of sentences to a RoBERTa encoder to obtain contextualized representations of each token
    '''
    # Preprocess sentences
    token_ids, attn_masks, input_lengths = self.robertify_input(sentences)

    # Feed through RoBERTa
    cont_reps, _ = self.roberta_layer(token_ids, attention_mask = attn_masks)

    return cont_reps, input_lengths

` In order to use the Roberta to initialize the weight, we have to use the tokenizer to tokenize the math word problem again. When we tokenize the sentence, the number position will be changed as well. we need to reindex the number position at the same time, otherwise ,we wold get the wrong number embedding. But I don't find this part in your code. Is there any part that I have ignore?

the-jb commented 2 years ago

I also agree with this. The num_pos should be recalculated after passing robertify tokenizer.