Merge in `tokenizer-tests` branch into `main`

Explicitly, this PR gives the ability to dynamically switch between different tokenizers at runtime[^flash-attn-sunspot].

Supported tokenizers:

TOKENIZER_TYPE="llama" [DEFAULT]
- will use the Llama2Tokenizer + whatever DATA_FILE_LIST specified
  - if no DATA_FILE_LIST specified, will fall back to ALCF/data-lists/${MACHINE}/dolma_v1_7_file_list.txt
TOKENIZER_TYPE="gpt"
- will use the GPT2BPETokenizer
  - by default, will use the BookCorpusDataset with the gpt2-merges and vocab file(s).

[^flash-attn-sunspot]: This is in an effort to better understand the behavior of flash-attn on Sunspot. For additional details see: 📸 flash-attn on Sunspot

argonne-lcf / Megatron-DeepSpeed

Merge in `tokenizer-tests` branch into `main` #17