axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.87k stars 866 forks source link

[POC] A few addons to axolotl training lifecycle #315

Open utensil opened 1 year ago

utensil commented 1 year ago

This is not an issue, but a description of a POC of a few addons to axolotl training lifecycle, some addons could be valid feature requests, and some are out-of-scope for axolotl itself but make sense for something like an "axolotl runner", I'll probably extract it to such a runner in a separate repo later, but I'll describe and show how it works here and now.

The purpose of this issue is to invite discussions about the POC and also the need to address all these during the axolotl training lifecycle.

TLDR

The POC implements the idea that simply committing a yaml config to a git repo could:

  1. spin up a RunPod (or vast.ai or any GPU cloud that has API support) GPU instance and prepare for the training (e.g. pre-download the model really fast)
  2. report (e.g. to W&B)/monit (e.g. whether the instance is idle for too long)/notify(e.g. to discord) in the process
  3. finish things up (uploading the trained model, terminate the instance etc.) after the training.

It looks like this in my discord channel, all triggered by committing a yml config (possibly from a phone) so that I can easily and efficiently test a yml config (or mutiple configs in parallel) even when I'm going to have lunch, drive to somewhere or handle anything in the reality:

image

In case anything goes wrong during training:

image

In case I forgot to or the script failed to terminate the instance and left it idle:

image

Warning: the POC works for me nicely, but the components are not well organized, and the code is hacky and messy at the moment.

An example yml file

There's an example yml file here: https://github.com/utensil/llm-playground/blob/main/tasks/test_project/minotaur.yml

The yml file is mostly just an ordinary axolotl yml config plus some extra configs for the GPU instance and some feature switch for the runner as shown below:

runpod:
  entry: |
    bash -c "curl -H 'Cache-Control: no-cache' https://raw.githubusercontent.com/utensil/llm-playground/main/scripts/entry/ax_lite_train.sh -sSf | bash"

  # "NVIDIA RTX A5000" # "NVIDIA RTX A6000" "NVIDIA A100-SXM4-80GB"
  gpu: "NVIDIA RTX A6000"
  # pod_type: INTERRUPTABLE
  cloud_type: "SECURE" # "ALL" "COMMUNITY"
  max_bid_per_gpu: 2.0
  # template_id: xxxxxxxxxx
  gpu_count: 1
  container_disk_in_gb: 50
  volume_in_gb: 200
  min_vcpu_count: 8
  min_memory_in_gb: 29
  min_download: 2000
  min_upload: 1500
  stop_after: 3600
  terminate_after: -1
  # Set to false to stay running after training
  one_shot: true
  env:
    TEST_ENV: happy
  # deepspeed: true

Most of them are self-explained and are passed to the RunPod API to create a GPU instance matching the criteria, one_shot is to indicate whether to terminate the instance after training.

For now I think it's easier to maintain these configs along with axolotl configs in one file, together they form a training task. The extra config might manipulate and enhance axolotl configs as you'll see below.

Step 1: Spin up a GPU instance

The related POC code comprised of:

The workflow will trigger another periodical workflow to monit the GPU instances and notify to the discord channel, so I won't forget terminating idle instances. When there's no GPU instances, the workflow will disable itself until it's woken by another training task.

The monit workflow could be enhanced to terminate the instance if it's an instance triggered automatically, and it remains idle for too long.

Step 2: Enhance the training

The related POC code is mostly in a Python script that monky-patches axolotl, it does the following:

Step 3: Finish things up

The related POC code is also in the Python script above, it does the following:

ashercn97 commented 1 year ago

THIS IS SO COOL

ashercn97 commented 1 year ago

WOULD I HAVE TO PAY FOR ACCESS TO THIS?

utensil commented 1 year ago

WOULD I HAVE TO PAY FOR ACCESS TO THIS?

No, it's all open-sourced and free to reuse to build your workflow. But currently this is too tightly integrated with my code base with other stuff and I haven't had the time to extract this to something that works simply by forking then adding a config yet.

ashercn97 commented 1 year ago

@utensil okay! Ty!