Explicitly, this PR gives the ability to dynamically switch between different tokenizers at runtime[^flash-attn-sunspot].
Supported tokenizers:
TOKENIZER_TYPE="llama" [DEFAULT]
will use the Llama2Tokenizer + whatever DATA_FILE_LIST specified
if no DATA_FILE_LIST specified, will fall back to ALCF/data-lists/${MACHINE}/dolma_v1_7_file_list.txt
TOKENIZER_TYPE="gpt"
will use the GPT2BPETokenizer
by default, will use the BookCorpusDataset with the gpt2-merges and vocab file(s).
[^flash-attn-sunspot]: This is in an effort to better understand the behavior of flash-attn on Sunspot.
For additional details see: 📸 flash-attn on Sunspot
Explicitly, this PR gives the ability to dynamically switch between different tokenizers at runtime[^flash-attn-sunspot].
Supported tokenizers:
TOKENIZER_TYPE="llama"
[DEFAULT]Llama2Tokenizer
+ whateverDATA_FILE_LIST
specifiedDATA_FILE_LIST
specified, will fall back toALCF/data-lists/${MACHINE}/dolma_v1_7_file_list.txt
TOKENIZER_TYPE="gpt"
GPT2BPETokenizer
BookCorpusDataset
with thegpt2-merges
andvocab
file(s).[^flash-attn-sunspot]: This is in an effort to better understand the behavior of
flash-attn
on Sunspot. For additional details see: 📸flash-attn
on Sunspot