Closed tscholak closed 1 week ago
Note to self:
hf_transfer
for downloading!@jlamypoirier
I introduced some classes so that we can have different data preparations for different models. This class hierarchy should eventually go somewhere else, depending on what the outcome of #34 will be. I only made one implementation for GPTs so far. We need another one for VLMs, which can be extracted from #5. I'm coordinating with @akshaykalkunte to do just that.
About the changes to the Dockerfile: I realized that we can install Fast-LLM in editable mode globally if we make /app writable for everyone. I therefore removed the fast-llm user. This also resolves a problem where fast-llm isn't usable if the user's id in the job doesn't match the id of the fast-llm user of the image. These changes are part of this PR because I needed them to test data preparation on Toolkit with the images created by the CI action. On Toolkit, the user has the id 13013 whereas CI builds the image with user id 1000. I don't think we can make any assumptions about the environment in which the official Fast-LLM image will be deployed, which is why it's better to remove user creation altogether.
I looked into using Fast-LLM's existing Distributed and DistributedConfig but found them too difficult to adapt to this straightforward CPU-only use case. I do not want to have to deal with distributed dims or CUDA rng initializations for running this simple data preparation code on multiple nodes. Users shouldn't have to bring GPUs for data processing if they aren't needed.
✨ Description
Extracted and refined the dataset preparation script from #17. Made it a command like
train
orconvert
. Example call and config:or
where
foo.yaml
contains:Run
git clone https://huggingface.co/HuggingFaceTB/SmolLM-135M
intmp
to get that tokenizer file.This will produce:
with
fast_llm_dataset.json
reading:The
downloaded_dataset
can be deleted afterwards. It is not used by Fast-LLM.🔍 Type of change
Select all that apply:
📝 Changes
prepare_dataset
commandDockerfile
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General:
Dependencies and Configuration:
Testing:
Performance Impact:
📊 Performance Impact Details
N/A
📝 Additional Notes
N/A