Closed csjackson0 closed 9 months ago
Hey Corey, thanks for this! One thing I don't understand is how we are going to work with batches. I was proposing to do this on the dataset side, but here you are doing it on the collator side, which means things are already batched and padded. As you can probably see, this means the task of picking which indices to mask, etc becomes pretty difficult. What are your thoughts?
Hi Kiarash, thank you for the feedback! The collator iterates through each sequence in a batch and applies the FIM transform. This would allow to have a mixture of both left-to-right and FIM for each batch.
Should span_start be at least a few residues away from the start of the aa_string? I believe currently the middle_span can start at position 0 in the aa_string.
Merged the PR into main - thank you @csjackson0 & @jamaliki !
The FIM collator subclasses DataCollatorForLanguageModeling and creates a custom collator for FIM modeling on the fly.