Open tscholak opened 3 weeks ago
That seems like something we want, but I'd like to clarify what the problem is exactly.
We already support Megatron-style blended datasets, it's what we started with and are still using. The problem came when we started using really big datasets that need to be split into multiple files. We don't really support that, so as a hack we decided to treat these as separate datasets. But that means we end up with hundreds of "datasets" in a hierarchy where there are actual "datasets" with meaningful probabilities, and actual files with probability = dataset_prob * (file_tokens/dataset_tokens). The json format, concatenate_datasets.py and mix_dataset.py are all hacks to help us with that.
So what we really want is allow for datasets that span across multiple files. My suggestion would be to make a concatenation wrapper for MMapIndexedDataset
so we can treat multi-file datasets as what they really are. I think this is compatible with your proposal, but gets rid of most preprocessing need.
Then there is the details of how multi-file datasets are configured. Providing a directory in the config is an option, but I think it would be safer to keep some index file, ex. a yaml file containing a list of data paths.
Providing a directory in the config is an option, but I think it would be safer to keep some index file, ex. a yaml file containing a list of data paths.
I understand where you're coming from, but this also creates more work for the user. It would be great if this didn't require any special tooling, because right now concatenate_dataset.py depends on MMapIndexedDataset
. Can we make this a simple file, where each line is a file path? No weights or token counts.
I understand where you're coming from, but this also creates more work for the user. It would be great if this didn't require any special tooling, because right now concatenate_dataset.py depends on
MMapIndexedDataset
. Can we make this a simple file, where each line is a file path? No weights or token counts.
Yes that's what I'm suggesting. Once datasets are concatenated instead of blended there is no more need for probabilities. Token counts and other metadata could be useful as an extra safety check but we can leave it out if it's too much trouble
We still need the ability to define the target proportions though for individual datasets (that themselves are split into many mmap'ed bin files), so that's why I think Fast-LLM's config classes should be changed to allow for this then:
datasets:
- path: /data/datasets/folder1/manifest.txt
target_proportion: 0.6
- path: /data/datasets/single_file.idx
target_proportion: 0.4
This would then eliminate the need for mix_datasets.py as well.
Yes that's a good plan. Right now it's taking a path:list[str]
in the megatron format [path1, prob1, path2, prob2, ...]
, but that's not a good format.
Now that lists of configs are supported, we could have something like datasets: list[DatasetConfig]
This one, right?
I suppose that can work. Not sure if they should be allowed to all have different multiprocessing settings.
I'm also not sure about:
what's list
and sample
?
This one, right?
I think datasets
would need to be a field in DataConfig
, and DatasetConfig
a new class that replaces format
, path
and probably split
. There is still a need for a general data config independent of individual datasets.
I'm also not sure about:
what's
list
andsample
?
List is the default megatron-like format,.sample
is a hack we made at some point to compare with huggingface, it reads a single sample from a numpy file and always return it, not sure we need it now that we have the huggingface wrapper.
We should be able to come up with some better way to define dataset formats, maybe something modular and model-dependent like I did with checkpoints? That would have the added benefit of making things a lot easier for custom models (#5, #20)
🧐 Problem Description
Currently, creating a training dataset with Fast-LLM involves a multi-step, cumbersome process:
This workflow is inefficient, error-prone (e.g., issue #71), and less user-friendly compared to other LLM training frameworks that offer simpler, more integrated data-loading mechanisms:
BlendedDataset
: Supports combining datasets with different weights directly:The additional steps required by Fast-LLM add complexity and reduce competitiveness in terms of data handling and preparation.
💡 Proposed Solution
Integrate the Preprocessing Step into Fast-LLM:
Revamp the Dataset Configuration Format:
Example Configuration
With this setup, Fast-LLM will automatically distribute the proportions among datasets within the specified paths.
🔄 Alternatives Considered
Keep the Existing Script-Based Workflow:
Provide a Standalone Utility for Merging and Weighting:
📈 Potential Benefits
Improved Usability:
Enhanced Competitiveness:
Streamlined Workflow:
📝 Additional Context
Integrating preprocessing directly into Fast-LLM would bring it closer to modern LLM frameworks that offer unified dataset preparation. This approach will facilitate future support for custom dataset implementations, such as streaming Parquet files from cloud storage (e.g., S3). For reference, frameworks like Mosaic's Composer already provide flexible data-loading options, making integration smoother.