NVIDIA / Megatron-Energon

Megatron's multi-modal data loader
Other
136 stars 12 forks source link

How to support text-only dataset (GPTDataset)? #28

Open leondada opened 5 days ago

leondada commented 5 days ago

How to support text-only dataset (GPTDataset)?

voegtlel commented 1 day ago

Hey @leondada,

for now, Energon primarily focuses on multi-modal data, e.g. text combined with image(s). For Text-only, there is the optimized data loader from the Megatron-Core repository: https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/datasets which also includes a few twists for effective randomized training. These are not (yet) implemented in Energon. Also, you should expect better speed with the specialized loader, if that is an issue for you.

Nevertheless, you can use Energon for text-only data. For that you'd need to create a webdataset containing your .txt files for your samples. Prepare that dataset using the prepare command, using the TextDataset type, then you should be able to get started :slightly_smiling_face: We don't have a full recipe for that yet