KHAKhazeus / BiBL

The official code release of the paper "BiBL: AMR Parsing and Generation with Bidirectional Bayesian Learning".
Other
8 stars 2 forks source link

Masking bugs in the collate function. #1

Closed Carlosml26 closed 4 weeks ago

Carlosml26 commented 2 years ago

Hi!

First of all, congrats on your paper being accepted in Coling. I would like to ask you about some parts of the code that I do not understand. They are related to the tokenizer.py specifically in the batch_encode_mask method. This method is going to prepare the batch masking part of the input sentence and graph, however right now it masking anything.

First, the _mask_collatepadded variable (that is the one used as _inputids of the batch) is created using _fil_collateids instead of_mask_fil_collateids which is the one that contains the masked inputs. :

for index, collate in enumerate(collate_ids):
    if len(collate) <= max_lm_len:
        maxlen = max((len(collate)), maxlen)
        fil_collate_ids.append(collate_ids[index])
        fil_rev_collate_ids.append(rev_collate_ids[index])
        mask_fil_sents_ids.append(mask_sents_ids[index])
        mask_fil_graph_ids.append(mask_graph_ids[index])
        mask_fil_collate_ids.append(mask_collate_ids[index])
        mask_fil_rev_collate_ids.append(mask_rev_collate_ids[index])
collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_collate_ids]
rev_collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_rev_collate_ids]
mask_collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_collate_ids]
mask_rev_collate_padded = [x + [self.pad_token_id] * (maxlen - len(x)) for x in fil_rev_collate_ids]

Therefore, right now the model is training without masking anything in the input. The second issue is related to the masking of the sentences in the same method since there isn't any masking strategy (there is just masking for the graphs). In the code, the variable _padded_maskx is created in the same way as _paddedx, without masking:

padded_x, _ = self.batch_encode_sentences(sentences, device=device)
# batch = super().batch_encode_plus(sentences, return_tensors='pt', pad_to_max_length=True)
for index, sent in enumerate(sentences):
    words = sent.split(" ")
    mask_num = math.floor(amr_ratio * len(words))
    rand_ind = random.sample(range(0, len(words)), mask_num)
    for ind in rand_ind:
        words[ind] = "<mask>"
    sentences[index] = " ".join(words)
padded_mask_x, _ = self.batch_encode_sentences(sentences, device=device)

This second issue is because of in the paper is explicitly define that the sentence is also masked.

Best,

Carlos

KHAKhazeus commented 2 years ago

Thanks for your kind reminder. After checking the code version, this part of the code was corrected afterward. I will correct this part manually in this repository and release a newer version with trained checkpoints ASAP.