aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.04k stars 6.75k forks source link

[Example Request] Minimal Example for Fine Tuning a LLM with FSDP utilizing the HuggingFace Trainer #4580

Open HuBaX opened 7 months ago

HuBaX commented 7 months ago

Describe the use case example you want to see I'm currently trying to figure out how to Fine Tune a LLM with FSDP on a single instance with multiple GPUs. For the training, I'm using the HuggingFace Trainer. Since I don't get it to work I scrolled through the model_parallel examples in this repo and found myself even more confused than before. All of the examples provided in this repo are so big that it's hard for me to understand what I have to do in order to simply get FSDP working for my use case, especially since I'm quite new to Sagemaker and never had to use FSDP before. I also don't know what work the HuggingFace Trainer already does for me when trying to use FSDP. I'd be glad if someone could provide a minimal example for my use case.

How would this example be used? Please describe. The example would be a reference for developers trying to get FSDP working with the HuggingFace Trainer.

Describe which SageMaker services are involved Notebook Instances and Training Jobs

**Describe what other services (other than SageMaker) are involved*** S3 - for loading the dataset as well as storing the model weights