allenai / PRIMER

The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Apache License 2.0
153 stars 32 forks source link

Pretraining-Mask sentences #23

Open Ronica1234 opened 1 year ago

Ronica1234 commented 1 year ago

Hello, In the primera pretrain process. The model choose 30% of the sentences by pyramid methods and then 50% of the candidates (15% of the sentences) will be mask while all 30% will be kept as the target. May I know why the 15% masked sentences will not be inputted in the target?

for i_d in range(len(truncated_doc)): for i_s in range(len(truncated_doc[i_d])): if cur_idx in mask_indices: tgt.append(truncated_doc[i_d][i_s])

here is the line which choose 50% percent of the candidates (30% percent of sentences) for masking

            if cur_idx not in non_mask_indices:
                truncated_doc[i_d][i_s] = '<mask>'#tokenizer.mask_token
        cur_idx += 1