tscholak commented 1 week ago

✨ Description

Extracted and refined the dataset preparation script from #17. Made it a command like train or convert. Example call and config:

fast-llm prepare gpt_memmap --config foo.yaml

or

torchrun --standalone --nnodes 1 --nproc_per_node=1 --no_python \
    fast-llm prepare gpt_memmap --config foo.yaml

where foo.yaml contains:

output_path: /tmp/foo

loading_workers: 4
tokenize_workers: 4
saving_workers: 4

dataset:
  name_or_path: stas/openwebtext-10k

tokenizer:
  path: /tmp/SmolLM-135M/tokenizer.json

Run git clone https://huggingface.co/HuggingFaceTB/SmolLM-135M in tmp to get that tokenizer file.

This will produce:

/tmp/foo
├── downloaded_dataset
│   ├── cache-1e5559f36da9962e_00002_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00003_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00001_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00000_of_00004.arrow
│   ├── data-00001-of-00004.arrow
│   ├── dataset_info.json
│   ├── data-00000-of-00004.arrow
│   ├── data-00002-of-00004.arrow
│   ├── data-00003-of-00004.arrow
│   ├── ok
│   └── state.json
├── shard_0_0.idx
├── shard_0_0.bin
└── fast_llm_dataset.json

with fast_llm_dataset.json reading:

{
    "datasets": [
        {
            "prefix": "shard_0_0",
            "num_documents": 10000,
            "num_tokens": 11569536,
            "weight": 1.0
        }
    ]
}

The downloaded_dataset can be deleted afterwards. It is not used by Fast-LLM.

🔍 Type of change

Select all that apply:

[x] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[x] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[x] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

Fixed memory-mapped indexed dataset and added round-trip tests
Added prepare_dataset command
Simplified Dockerfile

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General:

[x] 📜 I have read and followed the contributing guidelines.
[x] 🎉 The functionality is complete, and I have tested the changes.
[ ] 📝 I have updated the documentation if needed.
[ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
[x] 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration:

[ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
[ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing:

[x] 🧪 I have added or updated tests to cover my changes.
[x] ✔️ New and existing tests pass locally with my changes.
[ ] 🚦 I have tested these changes on GPUs and verified training stability.
[ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact:

[ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
[ ] ✅ The benchmarks show no performance regression.
[ ] 🚀 The benchmarks indicate a potential performance improvement.
[ ] ⚠️ The benchmarks indicate a potential performance degradation.
[ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

N/A

📝 Additional Notes

N/A

tscholak commented 1 week ago

Note to self:

[ ] Don't forget to enforce the use of hf_transfer for downloading!
[ ] Test multi-node processing with GLOO, https://pytorch.org/docs/stable/distributed.html.

tscholak commented 1 week ago

@jlamypoirier

I introduced some classes so that we can have different data preparations for different models. This class hierarchy should eventually go somewhere else, depending on what the outcome of #34 will be. I only made one implementation for GPTs so far. We need another one for VLMs, which can be extracted from #5. I'm coordinating with @akshaykalkunte to do just that.
About the changes to the Dockerfile: I realized that we can install Fast-LLM in editable mode globally if we make /app writable for everyone. I therefore removed the fast-llm user. This also resolves a problem where fast-llm isn't usable if the user's id in the job doesn't match the id of the fast-llm user of the image. These changes are part of this PR because I needed them to test data preparation on Toolkit with the images created by the CI action. On Toolkit, the user has the id 13013 whereas CI builds the image with user id 1000. I don't think we can make any assumptions about the environment in which the official Fast-LLM image will be deployed, which is why it's better to remove user creation altogether.
I looked into using Fast-LLM's existing Distributed and DistributedConfig but found them too difficult to adapt to this straightforward CPU-only use case. I do not want to have to deal with distributed dims or CUDA rng initializations for running this simple data preparation code on multiple nodes. Users shouldn't have to bring GPUs for data processing if they aren't needed.

ServiceNow / Fast-LLM

Add prepare command #38

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General:

Dependencies and Configuration:

Testing:

Performance Impact:

📊 Performance Impact Details

📝 Additional Notes