aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
177 stars 74 forks source link

End-to-End LLM Model Development with Torchtitan and Torchtune #341

Open KeitaW opened 4 months ago

KeitaW commented 4 months ago

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

KeitaW commented 4 months ago

Progress update: I have created draft README.md and tested following scripts working:

pbelevich commented 4 months ago

Preparation steps order look confusing to me. Basically we assume that a user starts this tutorial in some arbitrary location in filesystem but it already has file 0.create-dot-env.sh from this repo there and we ask to run it to create .env in this arbitrary location and then run source .env. Then we ask the user to go to come location defined by .env and clone this repo there. In the following steps 1.build-image.sbatch assumes that we need to have .env in this newly cloned location. So, the user should have two .env: one before git clone, another after git clone?

I suggest to change the Preparation steps this way:

  1. Choose and go to the working dir and run:
    cd <Some User defined FSX location>
    export FSX_PATH=`pwd`
  2. Clone awsome-distributed-training:
    git clone https://github.com/aws-samples/awsome-distributed-training
  3. Go to the test case path
    cd awsome-distributed-training/3.test_cases/torchtitan-torchtune/slurm
  4. Run 0.configure-env-vars.sh to create .env
    ./0.configure-env-vars.sh
KeitaW commented 4 months ago

Thanks @pbelevich for the suggestion. I agree. Updated README.md to guide to clone the repository first.

KeitaW commented 3 months ago

Basic functionalities have been implemented. Allow me to iterate on the other PRs...