Open utensil opened 1 year ago
THIS IS SO COOL
WOULD I HAVE TO PAY FOR ACCESS TO THIS?
WOULD I HAVE TO PAY FOR ACCESS TO THIS?
No, it's all open-sourced and free to reuse to build your workflow. But currently this is too tightly integrated with my code base with other stuff and I haven't had the time to extract this to something that works simply by forking then adding a config yet.
@utensil okay! Ty!
This is not an issue, but a description of a POC of a few addons to axolotl training lifecycle, some addons could be valid feature requests, and some are out-of-scope for axolotl itself but make sense for something like an "axolotl runner", I'll probably extract it to such a runner in a separate repo later, but I'll describe and show how it works here and now.
The purpose of this issue is to invite discussions about the POC and also the need to address all these during the axolotl training lifecycle.
TLDR
The POC implements the idea that simply committing a yaml config to a git repo could:
It looks like this in my discord channel, all triggered by committing a yml config (possibly from a phone) so that I can easily and efficiently test a yml config (or mutiple configs in parallel) even when I'm going to have lunch, drive to somewhere or handle anything in the reality:
In case anything goes wrong during training:
In case I forgot to or the script failed to terminate the instance and left it idle:
Warning: the POC works for me nicely, but the components are not well organized, and the code is hacky and messy at the moment.
An example yml file
There's an example yml file here: https://github.com/utensil/llm-playground/blob/main/tasks/test_project/minotaur.yml
The yml file is mostly just an ordinary axolotl yml config plus some extra configs for the GPU instance and some feature switch for the runner as shown below:
Most of them are self-explained and are passed to the RunPod API to create a GPU instance matching the criteria,
one_shot
is to indicate whether to terminate the instance after training.For now I think it's easier to maintain these configs along with axolotl configs in one file, together they form a training task. The extra config might manipulate and enhance axolotl configs as you'll see below.
Step 1: Spin up a GPU instance
The related POC code comprised of:
The workflow will trigger another periodical workflow to monit the GPU instances and notify to the discord channel, so I won't forget terminating idle instances. When there's no GPU instances, the workflow will disable itself until it's woken by another training task.
The monit workflow could be enhanced to terminate the instance if it's an instance triggered automatically, and it remains idle for too long.
Step 2: Enhance the training
The related POC code is mostly in a Python script that monky-patches axolotl, it does the following:
compute_metrics
and log the inputs, predictions, lables in the eval so one can check what is causing the eval loss spikes ( #311 )Step 3: Finish things up
The related POC code is also in the Python script above, it does the following:
one_shot
is set to true, terminate the GPU instance, so it would not cost too much, but ifone_shot
is set to false, then it would leave the instance on so that one can debug it using jupyterlabone_shot
setting, upload the model to HF, which is already supported by axolotl viahub_model_id
but I'll add the support to create a proper model card with readme, the base model, the datasets, the W&B run url etc. , some of these are specified in the config, some are automatically figured out from the config or the run so that humans don't have to maintain it manually.