[POC] A few addons to axolotl training lifecycle

utensil commented 1 year ago

This is not an issue, but a description of a POC of a few addons to axolotl training lifecycle, some addons could be valid feature requests, and some are out-of-scope for axolotl itself but make sense for something like an "axolotl runner", I'll probably extract it to such a runner in a separate repo later, but I'll describe and show how it works here and now.

The purpose of this issue is to invite discussions about the POC and also the need to address all these during the axolotl training lifecycle.

TLDR

The POC implements the idea that simply committing a yaml config to a git repo could:

spin up a RunPod (or vast.ai or any GPU cloud that has API support) GPU instance and prepare for the training (e.g. pre-download the model really fast)
report (e.g. to W&B)/monit (e.g. whether the instance is idle for too long)/notify(e.g. to discord) in the process
finish things up (uploading the trained model, terminate the instance etc.) after the training.

It looks like this in my discord channel, all triggered by committing a yml config (possibly from a phone) so that I can easily and efficiently test a yml config (or mutiple configs in parallel) even when I'm going to have lunch, drive to somewhere or handle anything in the reality:

In case anything goes wrong during training:

In case I forgot to or the script failed to terminate the instance and left it idle:

Warning: the POC works for me nicely, but the components are not well organized, and the code is hacky and messy at the moment.

An example yml file

There's an example yml file here: https://github.com/utensil/llm-playground/blob/main/tasks/test_project/minotaur.yml

The yml file is mostly just an ordinary axolotl yml config plus some extra configs for the GPU instance and some feature switch for the runner as shown below:

runpod:
  entry: |
    bash -c "curl -H 'Cache-Control: no-cache' https://raw.githubusercontent.com/utensil/llm-playground/main/scripts/entry/ax_lite_train.sh -sSf | bash"

  # "NVIDIA RTX A5000" # "NVIDIA RTX A6000" "NVIDIA A100-SXM4-80GB"
  gpu: "NVIDIA RTX A6000"
  # pod_type: INTERRUPTABLE
  cloud_type: "SECURE" # "ALL" "COMMUNITY"
  max_bid_per_gpu: 2.0
  # template_id: xxxxxxxxxx
  gpu_count: 1
  container_disk_in_gb: 50
  volume_in_gb: 200
  min_vcpu_count: 8
  min_memory_in_gb: 29
  min_download: 2000
  min_upload: 1500
  stop_after: 3600
  terminate_after: -1
  # Set to false to stay running after training
  one_shot: true
  env:
    TEST_ENV: happy
  # deepspeed: true

Most of them are self-explained and are passed to the RunPod API to create a GPU instance matching the criteria, one_shot is to indicate whether to terminate the instance after training.

For now I think it's easier to maintain these configs along with axolotl configs in one file, together they form a training task. The extra config might manipulate and enhance axolotl configs as you'll see below.

Step 1: Spin up a GPU instance

The related POC code comprised of:

a Github Actions workflow to detect the added/modified yml configs, run the Python script, which can also load secrets configured in the repo
a Python script to start a RunPod GPU instance matching the criteria specified in the config, with environment variables set up either from the RunPod template or the Gtihub secrets
an entry shell script to install deps and prepare for the training, this script also contains some ad-hoc fixes for a few issues
a modified RunPod API repo to add API support needed by this POC

The workflow will trigger another periodical workflow to monit the GPU instances and notify to the discord channel, so I won't forget terminating idle instances. When there's no GPU instances, the workflow will disable itself until it's woken by another training task.

The monit workflow could be enhanced to terminate the instance if it's an instance triggered automatically, and it remains idle for too long.

Step 2: Enhance the training

The related POC code is mostly in a Python script that monky-patches axolotl, it does the following:

pre-download the specified model from HF using aria2 with multi-thread which can download a 40b model in 1 minute and a half, and tell axolotl to use the local model
setup W&B earlier than axolotl and make axolotl reuse the same run so that the logs on W&B is more complete, containing the logs before the training is started, so one can debug early failures
supports accelerate, deepspeed in multiple GPU setting (will add and test FDSP soon), handles related subtleties
notify discord during the training for events that matters, e.g. the link to view the W&B run for logs, charts etc.
hijack compute_metrics and log the inputs, predictions, lables in the eval so one can check what is causing the eval loss spikes ( #311 )
more can be enhanced to trainer setup etc. without directly modifying axolotl, in future I might make it possible to add callbacks on the fly for the whole lifecycle of axolotl training
a few examples of more enhancements yet to implement
- auto adjust batch size to max VRAM utilization and avoid OOM
- auto adjust some lr, wd related hyperparameters when training goes wrong

Step 3: Finish things up

The related POC code is also in the Python script above, it does the following:

if anything goes wrong, ensure the error is logged to W&B and reported to Discord, then if one_shot is set to true, terminate the GPU instance, so it would not cost too much, but if one_shot is set to false, then it would leave the instance on so that one can debug it using jupyterlab
before terminating the GPU instance in one_shot setting, upload the model to HF, which is already supported by axolotl via hub_model_id but I'll add the support to create a proper model card with readme, the base model, the datasets, the W&B run url etc. , some of these are specified in the config, some are automatically figured out from the config or the run so that humans don't have to maintain it manually.

ashercn97 commented 1 year ago

THIS IS SO COOL

ashercn97 commented 1 year ago

WOULD I HAVE TO PAY FOR ACCESS TO THIS?

utensil commented 1 year ago

WOULD I HAVE TO PAY FOR ACCESS TO THIS?

No, it's all open-sourced and free to reuse to build your workflow. But currently this is too tightly integrated with my code base with other stuff and I haven't had the time to extract this to something that works simply by forking then adding a config yet.

ashercn97 commented 1 year ago

@utensil okay! Ty!

axolotl-ai-cloud / axolotl