End-to-End LLM Model Development with Torchtitan and Torchtune

KeitaW commented 4 months ago

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

KeitaW commented 4 months ago

Progress update: I have created draft README.md and tested following scripts working:

0.configure-env-vars.sh
1.build-image.sbatch
3.pretrain.sbatch
4.finetune.sbatch README.md in slurm subdirectory is still WIP.

pbelevich commented 4 months ago

Preparation steps order look confusing to me. Basically we assume that a user starts this tutorial in some arbitrary location in filesystem but it already has file 0.create-dot-env.sh from this repo there and we ask to run it to create .env in this arbitrary location and then run source .env. Then we ask the user to go to come location defined by .env and clone this repo there. In the following steps 1.build-image.sbatch assumes that we need to have .env in this newly cloned location. So, the user should have two .env: one before git clone, another after git clone?

I suggest to change the Preparation steps this way:

Choose and go to the working dir and run:

cd <Some User defined FSX location>
export FSX_PATH=`pwd`

Clone awsome-distributed-training:

git clone https://github.com/aws-samples/awsome-distributed-training

Go to the test case path

cd awsome-distributed-training/3.test_cases/torchtitan-torchtune/slurm

Run 0.configure-env-vars.sh to create .env
```
./0.configure-env-vars.sh
```

KeitaW commented 4 months ago

Thanks @pbelevich for the suggestion. I agree. Updated README.md to guide to clone the repository first.

KeitaW commented 3 months ago

Basic functionalities have been implemented. Allow me to iterate on the other PRs...

aws-samples / awsome-distributed-training

End-to-End LLM Model Development with Torchtitan and Torchtune #341