ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
37 stars 5 forks source link

Add prepare command #38

Closed tscholak closed 1 week ago

tscholak commented 1 week ago

✨ Description

Extracted and refined the dataset preparation script from #17. Made it a command like train or convert. Example call and config:

fast-llm prepare gpt_memmap --config foo.yaml

or

torchrun --standalone --nnodes 1 --nproc_per_node=1 --no_python \
    fast-llm prepare gpt_memmap --config foo.yaml

where foo.yaml contains:

output_path: /tmp/foo

loading_workers: 4
tokenize_workers: 4
saving_workers: 4

dataset:
  name_or_path: stas/openwebtext-10k

tokenizer:
  path: /tmp/SmolLM-135M/tokenizer.json

Run git clone https://huggingface.co/HuggingFaceTB/SmolLM-135M in tmp to get that tokenizer file.

This will produce:

/tmp/foo
├── downloaded_dataset
│   ├── cache-1e5559f36da9962e_00002_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00003_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00001_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00000_of_00004.arrow
│   ├── data-00001-of-00004.arrow
│   ├── dataset_info.json
│   ├── data-00000-of-00004.arrow
│   ├── data-00002-of-00004.arrow
│   ├── data-00003-of-00004.arrow
│   ├── ok
│   └── state.json
├── shard_0_0.idx
├── shard_0_0.bin
└── fast_llm_dataset.json

with fast_llm_dataset.json reading:

{
    "datasets": [
        {
            "prefix": "shard_0_0",
            "num_documents": 10000,
            "num_tokens": 11569536,
            "weight": 1.0
        }
    ]
}

The downloaded_dataset can be deleted afterwards. It is not used by Fast-LLM.

🔍 Type of change

Select all that apply:

📝 Changes

  1. Fixed memory-mapped indexed dataset and added round-trip tests
  2. Added prepare_dataset command
  3. Simplified Dockerfile

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General:

Dependencies and Configuration:

Testing:

Performance Impact:

📊 Performance Impact Details

N/A


📝 Additional Notes

N/A

tscholak commented 1 week ago

Note to self:

tscholak commented 1 week ago

@jlamypoirier

  1. I introduced some classes so that we can have different data preparations for different models. This class hierarchy should eventually go somewhere else, depending on what the outcome of #34 will be. I only made one implementation for GPTs so far. We need another one for VLMs, which can be extracted from #5. I'm coordinating with @akshaykalkunte to do just that.

  2. About the changes to the Dockerfile: I realized that we can install Fast-LLM in editable mode globally if we make /app writable for everyone. I therefore removed the fast-llm user. This also resolves a problem where fast-llm isn't usable if the user's id in the job doesn't match the id of the fast-llm user of the image. These changes are part of this PR because I needed them to test data preparation on Toolkit with the images created by the CI action. On Toolkit, the user has the id 13013 whereas CI builds the image with user id 1000. I don't think we can make any assumptions about the environment in which the official Fast-LLM image will be deployed, which is why it's better to remove user creation altogether.

  3. I looked into using Fast-LLM's existing Distributed and DistributedConfig but found them too difficult to adapt to this straightforward CPU-only use case. I do not want to have to deal with distributed dims or CUDA rng initializations for running this simple data preparation code on multiple nodes. Users shouldn't have to bring GPUs for data processing if they aren't needed.