Closed lmilbaum closed 6 months ago
I had a short chat with Rom. First step would be to bootstrap (with Terraform) a test environment instance. The first one would be for Nvidia. @Gregory-Pereira could you please identify the AWS instance requirements?
According to Stef, this is the instance type for the Nvidia test environment - g5.8xlarge 128 GB disk
First pass of this was merged in #411 using the basic image, but in the code you can see there the options for g5.8xlarge
machine type and 128 GB
of storage were provided but commented out (I want to get something working on the absolute minimum infra we need and scale up). Currently working on the Ansible playbook that will install the required dependencies based on Stef's PR
Currently working on the next pass in #413. This PR will install the bootc and e2e test deps onto the terraformed provisioned instance via the anisble playbook. I am however running into an issue. The E2E tests require the cuda-toolkit
and build-essentials
among other things. The cuda-toolkit
has no version available for Fedora 40. I have no access to the AWS account to create a new AMI based around Fedora 39, or Ubuntu 22.04 is another option. After discussion with Russel, it seems that their current workflows run on Ubuntu 22.04, so we will be basing our AMI off that to interface with them as fast as possible, and we can iterate from there. Will have to pick this up tomorrow with access from @lmilbaum
to summarize what i'm trying to do with https://github.com/instructlab/instructlab/pull/1016
ilab
on the host OS using the single GPU worker type available built-in to GitHub. That's what the PR includes that's working as of this afternooninstructlab/containers/cuda/Containerfile
image and uses that to run ilab
instead of installing it directly. This is the more interesting test, but the step above was a helpful stepping stone. It also provides a reference to compare back to if something isn't working with the container. Anyway, this is what I want to do next for instructlab
CI.The GPU worker from GitHub is a Tesla T4 with 16 GB RAM, so there are some limitations. ilab train
on Linux typically needs more than that, but there's a --4-bit-quant
option that makes it work ... up to a point. Converting the resulting model to gguf doesn't work. That's this issue: https://github.com/instructlab/instructlab/issues/579
It sounds like you're testing with more powerful infrastructure, so you'll be able to exercise a more extensive workflow than the "smoke test" style I'm trying to get into instructlab
CI.
A couple of us have kicked off writing E2E tests for the InstructLab workflow, based on the Containers that @rhatdan and friends have been adding to the AI Lab Recipes ... in the training directory.
To do that you'll probably need: From Dan Walsh: The InstructLab RHEL AI container images ... these are here https://github.com/containers/ai-lab-recipes/tree/main/training/instructlab ... or somewhere near quay.io/ai-lab
From Russell Bryant: The work on getting the e2e tests complete. I worked a bunch on this today as did he ... they should be somewhere near: https://github.com/instructlab/instructlab/pull/1016 ... in particular Russell made CPU testing (on beefy machines) run in less than an hour. As an initial test goal, I think the following combination should work: Beefy AWS instance with many CPUs The InstructLab cuda RHEL AI image ... which should automatically fall back to CPU pytorch logic The E2E test from Russel and the PR above Obviously the actual goal here is to use Bifrost RHEL AI accelerated images + internal InstructLab application container images for that testing. Dan is still working on those. But this is a good place to start. WDYT?
@Russell Bryant is also adding this test to InstructLab CI ... albeit without the Bifrost accelerated images. Anything to add Russell?