Masking bugs in the collate function.

Hi!

First of all, congrats on your paper being accepted in Coling. I would like to ask you about some parts of the code that I do not understand. They are related to the tokenizer.py specifically in the batch_encode_mask method. This method is going to prepare the batch masking part of the input sentence and graph, however right now it masking anything.

First, the _mask_collatepadded variable (that is the one used as _inputids of the batch) is created using _fil_collateids instead of_mask_fil_collateids which is the one that contains the masked inputs. :

for index, collate in enumerate(collate_ids):
    if len(collate) <= max_lm_len:
        maxlen = max((len(collate)), maxlen)
        fil_collate_ids.append(collate_ids[index])
        fil_rev_collate_ids.append(rev_collate_ids[index])
        mask_fil_sents_ids.append(mask_sents_ids[index])
        mask_fil_graph_ids.append(mask_graph_ids[index])
        mask_fil_collate_ids.append(mask_collate_ids[index])
        mask_fil_rev_collate_ids.append(mask_rev_collate_ids[index])
collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_collate_ids]
rev_collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_rev_collate_ids]
mask_collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_collate_ids]
mask_rev_collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_rev_collate_ids]

Therefore, right now the model is training without masking anything in the input. The second issue is related to the masking of the sentences in the same method since there isn't any masking strategy (there is just masking for the graphs). In the code, the variable _padded_maskx is created in the same way as _paddedx, without masking:

padded_x, _ = self.batch_encode_sentences(sentences, device=device)
# batch = super().batch_encode_plus(sentences, return_tensors='pt', pad_to_max_length=True)
for index, sent in enumerate(sentences):
    words = sent.split(" ")
    mask_num = math.floor(amr_ratio * len(words))
    rand_ind = random.sample(range(0, len(words)), mask_num)
    for ind in rand_ind:
        words[ind] = "<mask>"
    sentences[index] = " ".join(words)
padded_mask_x, _ = self.batch_encode_sentences(sentences, device=device)

This second issue is because of in the paper is explicitly define that the sentence is also masked.

Best,

Carlos

KHAKhazeus / BiBL

Masking bugs in the collate function. #1