How to support text-only dataset (GPTDataset)?

Hey @leondada,

for now, Energon primarily focuses on multi-modal data, e.g. text combined with image(s). For Text-only, there is the optimized data loader from the Megatron-Core repository: https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/datasets which also includes a few twists for effective randomized training. These are not (yet) implemented in Energon. Also, you should expect better speed with the specialized loader, if that is an issue for you.

Nevertheless, you can use Energon for text-only data. For that you'd need to create a webdataset containing your .txt files for your samples. Prepare that dataset using the prepare command, using the TextDataset type, then you should be able to get started :slightly_smiling_face: We don't have a full recipe for that yet

NVIDIA / Megatron-Energon

How to support text-only dataset (GPTDataset)? #28