E2E testing for RHEL AI images

lmilbaum commented 6 months ago

A couple of us have kicked off writing E2E tests for the InstructLab workflow, based on the Containers that @rhatdan and friends have been adding to the AI Lab Recipes ... in the training directory.

To do that you'll probably need: From Dan Walsh: The InstructLab RHEL AI container images ... these are here https://github.com/containers/ai-lab-recipes/tree/main/training/instructlab ... or somewhere near quay.io/ai-lab

From Russell Bryant: The work on getting the e2e tests complete. I worked a bunch on this today as did he ... they should be somewhere near: https://github.com/instructlab/instructlab/pull/1016 ... in particular Russell made CPU testing (on beefy machines) run in less than an hour. As an initial test goal, I think the following combination should work: Beefy AWS instance with many CPUs The InstructLab cuda RHEL AI image ... which should automatically fall back to CPU pytorch logic The E2E test from Russel and the PR above Obviously the actual goal here is to use Bifrost RHEL AI accelerated images + internal InstructLab application container images for that testing. Dan is still working on those. But this is a good place to start. WDYT?

@Russell Bryant is also adding this test to InstructLab CI ... albeit without the Bifrost accelerated images. Anything to add Russell?

lmilbaum commented 6 months ago

I had a short chat with Rom. First step would be to bootstrap (with Terraform) a test environment instance. The first one would be for Nvidia. @Gregory-Pereira could you please identify the AWS instance requirements?

lmilbaum commented 6 months ago

According to Stef, this is the instance type for the Nvidia test environment - g5.8xlarge 128 GB disk

Gregory-Pereira commented 6 months ago

First pass of this was merged in #411 using the basic image, but in the code you can see there the options for g5.8xlarge machine type and 128 GB of storage were provided but commented out (I want to get something working on the absolute minimum infra we need and scale up). Currently working on the Ansible playbook that will install the required dependencies based on Stef's PR

Gregory-Pereira commented 6 months ago

Currently working on the next pass in #413. This PR will install the bootc and e2e test deps onto the terraformed provisioned instance via the anisble playbook. I am however running into an issue. The E2E tests require the cuda-toolkit and build-essentials among other things. The cuda-toolkit has no version available for Fedora 40. I have no access to the AWS account to create a new AMI based around ~~Fedora 39, or~~ Ubuntu 22.04 ~~is another option~~. After discussion with Russel, it seems that their current workflows run on Ubuntu 22.04, so we will be basing our AMI off that to interface with them as fast as possible, and we can iterate from there. Will have to pick this up tomorrow with access from @lmilbaum

russellb commented 6 months ago

to summarize what i'm trying to do with https://github.com/instructlab/instructlab/pull/1016

get a first end-to-end workflow running in CI, running ilab on the host OS using the single GPU worker type available built-in to GitHub. That's what the PR includes that's working as of this afternoon
once ^^^ is settled, make a variant of it that builds the instructlab/containers/cuda/Containerfile image and uses that to run ilab instead of installing it directly. This is the more interesting test, but the step above was a helpful stepping stone. It also provides a reference to compare back to if something isn't working with the container. Anyway, this is what I want to do next for instructlab CI.

The GPU worker from GitHub is a Tesla T4 with 16 GB RAM, so there are some limitations. ilab train on Linux typically needs more than that, but there's a --4-bit-quant option that makes it work ... up to a point. Converting the resulting model to gguf doesn't work. That's this issue: https://github.com/instructlab/instructlab/issues/579

It sounds like you're testing with more powerful infrastructure, so you'll be able to exercise a more extensive workflow than the "smoke test" style I'm trying to get into instructlab CI.

containers / ai-lab-recipes

E2E testing for RHEL AI images #391