Define test for training operator in airgapped environment

canonical / bundle-kubeflow

Charmed Kubeflow

Apache License 2.0

97 stars 47 forks source link

Define test for training operator in airgapped environment #919

Closed NohaIhab closed 1 day ago

NohaIhab commented 1 month ago

Context

Currently, there is no defined test for training operator in an airgapped environment. We need to define and document the testing process.

What needs to get done

Look into the feasible tests in an airgapped environment for training operator
Define and document the set of tests to be run for training operator

Definition of Done

The testing process for training operator in airgapped is defined and documented.

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5773.

This message was autogenerated

kimwnasptd commented 3 weeks ago

With a quick look over the Training Operator code it looks like the operator can be parameterised for images of different init containers: https://github.com/kubeflow/training-operator/blob/v1.7-branch/cmd/training-operator.v1/main.go#L89-L98

Although, they don't set any actual/usable value in the code. And also don't set these in the manifests https://github.com/kubeflow/manifests/tree/v1.9-branch/apps/training-operator/upstream/base

So it's not yet clear when someone might need to set those values. We'll need to confirm

NohaIhab commented 1 week ago

we'll need to define a manual test because in order to do it programmatically in a notebook we'd have to pip install kubeflow-training package since it is not pre-installed in the notebook server images. This we cannot do in airgapped. The other option is to create our own notebook image that includes the kubeflow-training python package - I'd like to avoid adding another image for the team to maintain if we can have a manual test with simple instructions for now.

NohaIhab commented 1 week ago

I tracked down the image used in the TFJob example in our UATs, it is using the input_data function from the tensorflow.examples.tutorials.mnist package, and the image has base tensorflow/tensorflow:1.11.0. Looking at the input_data function source code, there is fake_data boolean argument that is exposed in the mnist_with_summaries.py module that generates some dummy values to test with. We can:

use the fake_data argument to avoid downloading data
explore the data_dir argument to copy the data into the airgapped env and using it from local instead of downloading it.

I believe 1. is sufficient as we don't need to test with real data, all we need to cover in this task is the training operator functionality, whether it is real or fake data should not affect this.