AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
83 stars 27 forks source link

Integrate kind for local testing #242

Closed IrvingMg closed 1 week ago

IrvingMg commented 3 weeks ago

Fixes / Features

As part of the kjob integration in #212, we are developing new features where having a full cluster on GKE with GPUs, Vertex AI, etc., feels like overkill. To simplify the development and testing process, this PR implements a local testing environment using kind.

This PR adds a new command to xpk for managing local Kubernetes clusters with kind, as well as the --kind-cluster flag for the batch command. The command to create a local cluster looks like this:

python3 xpk.py kind create --cluster xpk-test

Once the local cluster is set up, you can use the --kind-cluster flag to run xpk commands against the local cluster instead of GKE. For example:

python3 xpk.py batch [other-options] --kind-cluster

This PR implements both the command for creating and managing local kind clusters and the --kind-cluster flag specifically for the batch command, which interacts with kjob and is well-suited for testing the local environment. Support for local testing may be extended to other xpk commands as needed.

For more details on how to set up and use the local testing environment, please refer to the updated README.

Testing / Documentation

Testing details.

google-cla[bot] commented 3 weeks ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

IrvingMg commented 3 weeks ago

cc: @mbobrovskyi @mwysokin

pawloch00 commented 3 weeks ago

I doubt if we should merge it into main. Let me ask about it

pawloch00 commented 2 weeks ago

In my opinion merging it to main with only batch command being executed localy is not worth it. If more commands can be added and kind can be used for ex. integration tests in github pipeline than I approve it

IrvingMg commented 2 weeks ago

In my opinion merging it to main with only batch command being executed localy is not worth it. If more commands can be added and kind can be used for ex. integration tests in github pipeline than I approve it

I agree it might seem not worth it as we're only adding the option of local testing for batch command. However, I think this may be a good opportunity to introduce local testing. To cover all the cases for local testing in a single PR seems to be a big effort, so I think we can add support for other commands bit by bit.

Besides that, currently, we have a good couple of examples - #236, #244 - where having the option of testing locally is useful. As these features don't require to have a cluster with special resources such as GPU, TPU, etc., I think we can avoid creating a cluster on GKE, saving time and costs, for testing by using a local cluster.

pawloch00 commented 2 weeks ago

OK, it would be nice to create integration test using kind as followup task for this PR and use it in github pipeline where possible

IrvingMg commented 1 week ago

OK, it would be nice to create integration test using kind as followup task for this PR and use it in github pipeline where possible

Opened issue for integration test #267