OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

fill-in-the-middle collator #32

Closed csjackson0 closed 9 months ago

csjackson0 commented 10 months ago

The FIM collator subclasses DataCollatorForLanguageModeling and creates a custom collator for FIM modeling on the fly.

jamaliki commented 10 months ago

Hey Corey, thanks for this! One thing I don't understand is how we are going to work with batches. I was proposing to do this on the dataset side, but here you are doing it on the collator side, which means things are already batched and padded. As you can probably see, this means the task of picking which indices to mask, etc becomes pretty difficult. What are your thoughts?

csjackson0 commented 10 months ago

Hi Kiarash, thank you for the feedback! The collator iterates through each sequence in a batch and applies the FIM transform. This would allow to have a mixture of both left-to-right and FIM for each batch.

Should span_start be at least a few residues away from the start of the aa_string? I believe currently the middle_span can start at position 0 in the aa_string.

pascalnotin commented 9 months ago

Merged the PR into main - thank you @csjackson0 & @jamaliki !