tscholak commented 3 weeks ago

🧐 Problem Description

Currently, creating a training dataset with Fast-LLM involves a multi-step, cumbersome process:

Organizing Datasets: Start with a collection of memory-mapped Megatron dataset files in different folders, each typically corresponding to a dataset (e.g., Fineweb-edu or The Stack).
Generating JSON Manifests: Use a separate script (concatenate_dataset.py) to create a JSON manifest for each folder. Each file entry is weighted based on its token count to ensure uniform sampling by tokens.
Defining the Dataset Mix: Use fml-ops' mix_datasets.py to combine these manifests into a final weighted dataset mix (e.g., 30% from The Stack and 70% from Fineweb-edu).

This workflow is inefficient, error-prone (e.g., issue #71), and less user-friendly compared to other LLM training frameworks that offer simpler, more integrated data-loading mechanisms:

Mosaic Composer: Allows composing datasets with minimal code:

stream_A = Stream(remote='s3://stream_A_remote', local='/tmp/stream_A', proportion=0.25)
stream_B = Stream(remote='s3://stream_B_remote', local='/tmp/stream_B', proportion=0.75)
dataset = StreamingDataset(streams=[stream_A, stream_B])

Megatron's BlendedDataset: Supports combining datasets with different weights directly:

BlendDataset(
    (get_train_dataset('/my/dataset/image_dataset', ...), 0.6),
    (get_train_dataset('/my/dataset/captioning_dataset', ...), 0.4),
)

The additional steps required by Fast-LLM add complexity and reduce competitiveness in terms of data handling and preparation.

💡 Proposed Solution

Integrate the Preprocessing Step into Fast-LLM:
- Embed the current preprocessing capabilities directly into the Fast-LLM framework, allowing it to load complex dataset mixtures without requiring a separate preprocessing steps.
Revamp the Dataset Configuration Format:
- Update the format to specify a list of data paths, each with a target proportion representing the fraction of the final dataset's tokens that will come from that path.
- For directories, distribute the target proportion across datasets in the folder based on their token counts. For individual files, apply the proportion directly.
- Extend support to additional formats, such as Parquet files, in the future.

Example Configuration

datasets:
  - path: /data/datasets/folder1
    target_proportion: 0.6
  - path: /data/datasets/single_file.idx
    target_proportion: 0.4

With this setup, Fast-LLM will automatically distribute the proportions among datasets within the specified paths.

🔄 Alternatives Considered

Keep the Existing Script-Based Workflow:
- This option would retain complexity and dependencies (e.g., fml-ops), requiring users to manage intermediate files and multiple steps—issues we've already faced.
Provide a Standalone Utility for Merging and Weighting:
- While combining the tooling into one utility would reduce some complexity, it would still separate preprocessing from the main training workflow and add dependencies on Fast-LLM's dataset implementation.

📈 Potential Benefits

Improved Usability:
- Loading datasets directly from structured folders simplifies usage, saving time and effort for existing users and new adopters, who may otherwise find the current process daunting.
Enhanced Competitiveness:
- Bringing data handling into Fast-LLM will align it with alternative frameworks that offer more seamless data-loading capabilities.
Streamlined Workflow:
- Reducing the steps from data preparation to training will improve efficiency and reduce potential for user errors.

📝 Additional Context

Integrating preprocessing directly into Fast-LLM would bring it closer to modern LLM frameworks that offer unified dataset preparation. This approach will facilitate future support for custom dataset implementations, such as streaming Parquet files from cloud storage (e.g., S3). For reference, frameworks like Mosaic's Composer already provide flexible data-loading options, making integration smoother.

jlamypoirier commented 3 weeks ago

That seems like something we want, but I'd like to clarify what the problem is exactly.

We already support Megatron-style blended datasets, it's what we started with and are still using. The problem came when we started using really big datasets that need to be split into multiple files. We don't really support that, so as a hack we decided to treat these as separate datasets. But that means we end up with hundreds of "datasets" in a hierarchy where there are actual "datasets" with meaningful probabilities, and actual files with probability = dataset_prob * (file_tokens/dataset_tokens). The json format, concatenate_datasets.py and mix_dataset.py are all hacks to help us with that.

So what we really want is allow for datasets that span across multiple files. My suggestion would be to make a concatenation wrapper for MMapIndexedDataset so we can treat multi-file datasets as what they really are. I think this is compatible with your proposal, but gets rid of most preprocessing need.

Then there is the details of how multi-file datasets are configured. Providing a directory in the config is an option, but I think it would be safer to keep some index file, ex. a yaml file containing a list of data paths.

tscholak commented 3 weeks ago

Providing a directory in the config is an option, but I think it would be safer to keep some index file, ex. a yaml file containing a list of data paths.

I understand where you're coming from, but this also creates more work for the user. It would be great if this didn't require any special tooling, because right now concatenate_dataset.py depends on MMapIndexedDataset. Can we make this a simple file, where each line is a file path? No weights or token counts.

jlamypoirier commented 3 weeks ago

I understand where you're coming from, but this also creates more work for the user. It would be great if this didn't require any special tooling, because right now concatenate_dataset.py depends on MMapIndexedDataset. Can we make this a simple file, where each line is a file path? No weights or token counts.

Yes that's what I'm suggesting. Once datasets are concatenated instead of blended there is no more need for probabilities. Token counts and other metadata could be useful as an extra safety check but we can leave it out if it's too much trouble

tscholak commented 3 weeks ago

We still need the ability to define the target proportions though for individual datasets (that themselves are split into many mmap'ed bin files), so that's why I think Fast-LLM's config classes should be changed to allow for this then:

datasets:
  - path: /data/datasets/folder1/manifest.txt
    target_proportion: 0.6
  - path: /data/datasets/single_file.idx
    target_proportion: 0.4

This would then eliminate the need for mix_datasets.py as well.

jlamypoirier commented 3 weeks ago

Yes that's a good plan. Right now it's taking a path:list[str] in the megatron format [path1, prob1, path2, prob2, ...], but that's not a good format. Now that lists of configs are supported, we could have something like datasets: list[DatasetConfig]

tscholak commented 3 weeks ago

This one, right?

https://github.com/ServiceNow/Fast-LLM/blob/f9880e29e4ee731e8ba36575f112811046d20bab/fast_llm/data/config.py#L150

I suppose that can work. Not sure if they should be allowed to all have different multiprocessing settings.

I'm also not sure about:

https://github.com/ServiceNow/Fast-LLM/blob/f9880e29e4ee731e8ba36575f112811046d20bab/fast_llm/data/config.py#L14-L24

what's list and sample?

jlamypoirier commented 3 weeks ago

This one, right?

https://github.com/ServiceNow/Fast-LLM/blob/f9880e29e4ee731e8ba36575f112811046d20bab/fast_llm/data/config.py#L150

I think datasets would need to be a field in DataConfig, and DatasetConfig a new class that replaces format, path and probably split. There is still a need for a general data config independent of individual datasets.

I'm also not sure about:

https://github.com/ServiceNow/Fast-LLM/blob/f9880e29e4ee731e8ba36575f112811046d20bab/fast_llm/data/config.py#L14-L24

what's list and sample?

List is the default megatron-like format,.sample is a hack we made at some point to compare with huggingface, it reads a single sample from a numpy file and always return it, not sure we need it now that we have the huggingface wrapper.

We should be able to come up with some better way to define dataset formats, maybe something modular and model-dependent like I did with checkpoints? That would have the added benefit of making things a lot easier for custom models (#5, #20)

ServiceNow / Fast-LLM

[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25

🧐 Problem Description

💡 Proposed Solution

Example Configuration

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context