Extend data collator to support fill-in-the-middle modeling on the fly

OpenBioML / protein-lm-scaling

Other

54 stars 15 forks source link

Extend data collator to support fill-in-the-middle modeling on the fly #24

Closed jeffreyruffolo closed 9 months ago

jeffreyruffolo commented 10 months ago

Fill-in-the-middle (FIM) is described in this paper from OpenAI, as well as other prior work: https://arxiv.org/pdf/2207.14255.pdf

Ideally, our data collator could take some FIM frequency so that we can tune the rate of FIM vs left-to-right training examples.

jamaliki commented 10 months ago

Can I take this one?

jeffreyruffolo commented 10 months ago

Sure! One key decision here is what tokens we should use to denote the start and end of the region that is moved to the end of the sequence for FIM modeling. Any special tokens should work, but might be nice if we maintain consistency with an “end” token used for standard decoding and just introduce new tokens for the position of the moved region and its end.

jamaliki commented 10 months ago

Yeah we can just add tokens. Do you have any preferences from IgLM?

jeffreyruffolo commented 10 months ago

No strong preferences, something like [SPAN], [EOS]/2, [SEP] for the masked region, true end-of-sequence (ie C-terminus), and end of span, respectively, could work and be interpretable.

jamaliki commented 10 months ago

How about the following?

[BOS]GLEAVNKDKPLGAVALKSYEEEL[MASKED_SPAN]NAQKGEIMPNIPQMSAFWYAVRTAVIN[EOS][START_SPAN]AKDPRIAATME[END_SPAN]

jeffreyruffolo commented 10 months ago

I think you could skip [START_SPAN], since it would always follow an [EOS]. Otherwise looks good!

jamaliki commented 10 months ago

What about the cases where we aren't doing fill in the middle?

jeffreyruffolo commented 10 months ago

Those cases would fall back to:

[BOS]GLEAVNKDKPLGAVALKSYEEELAKDPRIAATMENAQKGEIMPNIPQMSAFWYAVRTAVIN[EOS]

pascalnotin commented 9 months ago

Closing the issue since resolved with PR#32