Closed jeffreyruffolo closed 9 months ago
Can I take this one?
Sure! One key decision here is what tokens we should use to denote the start and end of the region that is moved to the end of the sequence for FIM modeling. Any special tokens should work, but might be nice if we maintain consistency with an “end” token used for standard decoding and just introduce new tokens for the position of the moved region and its end.
Yeah we can just add tokens. Do you have any preferences from IgLM?
No strong preferences, something like [SPAN]
, [EOS]
/2
, [SEP]
for the masked region, true end-of-sequence (ie C-terminus), and end of span, respectively, could work and be interpretable.
How about the following?
[BOS]GLEAVNKDKPLGAVALKSYEEEL[MASKED_SPAN]NAQKGEIMPNIPQMSAFWYAVRTAVIN[EOS][START_SPAN]AKDPRIAATME[END_SPAN]
I think you could skip [START_SPAN]
, since it would always follow an [EOS]
. Otherwise looks good!
What about the cases where we aren't doing fill in the middle?
Those cases would fall back to:
[BOS]GLEAVNKDKPLGAVALKSYEEELAKDPRIAATMENAQKGEIMPNIPQMSAFWYAVRTAVIN[EOS]
Closing the issue since resolved with PR#32
Fill-in-the-middle (FIM) is described in this paper from OpenAI, as well as other prior work: https://arxiv.org/pdf/2207.14255.pdf
Ideally, our data collator could take some FIM frequency so that we can tune the rate of FIM vs left-to-right training examples.