Closed i-chvets closed 1 year ago
Hi @i-chvets, do you mind adding steps to reproduce and some logs? Is this only failing when upgrading or on install as well?
I was able to get the same message:
training-operator/0* blocked idle 10.1.216.35 K8S resources creation failed
juju deploy training-operator --channel 1.5/stable --trust #wait until it's active and idle
juju refresh training-operator --channel latest/edge
training-operator
should be in BlockedStatus
Let's dig a bit deeper in the charm code to figure out this one.
@DnPlas any issues on AMD ?
juju deploy training-operator --channel 1.6/stable --trust
ERROR selecting releases: charm or bundle not found for channel "1.6/stable", platform "amd64"
available releases are:
channel "stable": available series are: focal
channel "candidate": available series are: focal
channel "beta": available series are: focal
channel "edge": available series are: focal
channel "1.3/stable": available series are: focal
channel "1.5/beta": available series are: focal
channel "1.5/edge": available series are: focal
channel "1.3/candidate": available series are: focal
channel "1.3/beta": available series are: focal
channel "1.3/edge": available series are: focal
channel "1.5/stable": available series are: focal
channel "1.5/candidate": available series are: focal
also
juju refresh training-operator --channel latest/edge --trust
ERROR option provided but not defined: --trust
what is the juju version you use?
Upgrading training-operator
from stable
(Rev 6) to latest/edge
(Rev 135):
juju deploy training-operator --channel stable --trust
juju refresh training-operator --channel latest/edge
In KF 1.6 bundletraining-operator
is in 1.5/stable
channel (Rev 65). The proper testing procedure:
juju deploy training-operator --channel 1.5/stable --trust
juju refresh training-operator --channel latest/edge
To reproduce using local charm:
rev93
version of training-operator
and deploy it.training-operator
and refresh the charm.Blocked
state.juju deploy ./training-operator_ubuntu-20.04-amd64.charm.old --resource="training-operator-image=kubeflow/training-operator:v1-8c6eab2" --trust
juju refresh training-operator --path=./training-operator_ubuntu-20.04-amd64.charm --resource="training-operator-image=kubeflow/training-operator:v1-27e5499"
Unit Workload Agent Address Ports Message
training-operator/0* blocked idle 10.1.59.79 ...: Apply failed with 1 conflict: conflict with "python-httpx" using rbac.authorization.k8s.io/v1: .rules
@i-chvets it might be that podspec charm configs conflict with those that are pushed by Juju. Maybe during upgrade need to clean them up before installing sidecar ?
@i-chvets I am confused, these steps to reproduce, the error there, and the original description of this bug do not match. Could you please confirm what exactly is the issue you are running into and add steps to reproduce?
Also, please note that v1.5 and v1.6 of the training-operator
are both sidecar charms, no need to involve any podspec charm.
@i-chvets it might be that podspec charm configs conflict with those that are pushed by Juju. Maybe during upgrade need to clean them up before installing sidecar ?
~~It looks like it is due to order of K8S manifests that are being applied during refresh. In some charms (Seldon, Training Operator) we separated CRD and K8S resources deployment. And in some cases CRD must be installed before other manifests. In other charms CRDs and other manifests are applied in the by the same resource handler that does sorting of the resources before it applies them. To solve the issue, we either handle order in charm, or we revert to using single K8S handler to do ordering for us. I am leaning towards putting everything back into single handler. I need to confirm this first.~~
The above was not the reason.
@i-chvets I am confused, these steps to reproduce, the error there, and the original description of this bug do not match. Could you please confirm what exactly is the issue you are running into and add steps to reproduce?
Also, please note that v1.5 and v1.6 of the
training-operator
are both sidecar charms, no need to involve any podspec charm.
I just reproduced using local charms. I don't think versions really matter in this case. I will confirm this. I will re-create steps with proper versions 1.5 and 1.6.
I took PodSpec version to ensure that charm is old enough.
Also fails with sidecar rev93
charm: refresh of local charms from rev93
to latest fails.
It looks like it is due to order of K8S manifests that are being applied during refresh. In some charms (Seldon, Training Operator) we separated CRD and K8S resources deployment. And in some cases CRD must be installed before other manifests. In other charms CRDs and other manifests are applied in the by the same resource handler that does sorting of the resources before it applies them. To solve the issue, we either handle order in charm, or we revert to using single K8S handler to do ordering for us. I am leaning towards putting everything back into single handler. I need to confirm this first.
@i-chvets before changing anything, we need to understand what is the issue you are running into. Please modify the description of this issue with logs, error messages and steps to reproduce. You have referenced two different error messages and two different set of steps to reproduce, and that makes it hard to understand.
It looks like it is due to order of K8S manifests that are being applied during refresh. In some charms (Seldon, Training Operator) we separated CRD and K8S resources deployment. And in some cases CRD must be installed before other manifests. In other charms CRDs and other manifests are applied in the by the same resource handler that does sorting of the resources before it applies them. To solve the issue, we either handle order in charm, or we revert to using single K8S handler to do ordering for us. I am leaning towards putting everything back into single handler. I need to confirm this first.
@i-chvets before changing anything, we need to understand what is the issue you are running into. Please modify the description of this issue with logs, error messages and steps to reproduce. You have referenced two different error messages and two different set of steps to reproduce, and that makes it hard to understand.
The error message that you see in local charm steps is the one that is buried deep in the framework. I just changed top level generic K8S resources creation failed
message to actually report the error the framework gives us.
Instructions are the same: refresh from one version to another given manifests are different between versions.
In previous training-operator
charm (1.5/stable
) resources were created without specifying field_manager
. After the re-write with chisme
and using Server-side Apply method field_manager
is a required parameter and is specified in our latest charm as lightkube
. As a result, changes in CRD are not allowed, because field_managers
are different: not-specified one in previous charm versus lightkube
in latest charm.
Explanation of reason for the conflict: https://kubernetes.io/docs/reference/using-api/server-side-apply/
There is an option to do forced change. It will require charm code changes and chisme
code changes.
Verified by this script lightkube-client.py
#!/usr/bin/python3
from lightkube import ApiError, Client, codecs
from lightkube.resources.apiextensions_v1 import CustomResourceDefinition
client = Client(field_manager="lightkube")
with open('crds_manifests.yaml') as f:
for obj in codecs.load_all_yaml(f):
try:
res = client.apply(obj)
except ApiError as err:
print(f"Failed to apply {obj.metadata.name}: {err}")
print("Apply with force='True'")
res = client.apply(obj, force=True)
After deployment of 1.5/stable
:
$ kubectl get crds -l app.juju.is/created-by=training-operator -o=yaml | grep "ontroller-gen.kubebuilder.io/version"
controller-gen.kubebuilder.io/version: v0.6.0
controller-gen.kubebuilder.io/version: v0.6.0
controller-gen.kubebuilder.io/version: v0.6.0
controller-gen.kubebuilder.io/version: v0.6.0
controller-gen.kubebuilder.io/version: v0.6.0
Use script above to update CRDs. Result:
$ ./lightkube-client.py
Failed to apply xgboostjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply tfjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply pytorchjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply mxjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply mpijobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
$ kubectl get crds -l app.juju.is/created-by=training-operator -o=yaml | grep "ontroller-gen.kubebuilder.io/version"
controller-gen.kubebuilder.io/version: v0.10.0
controller-gen.kubebuilder.io/version: v0.10.0
controller-gen.kubebuilder.io/version: v0.10.0
controller-gen.kubebuilder.io/version: v0.10.0
controller-gen.kubebuilder.io/version: v0.10.0
Need to address this issue first.
Proposed fix:
force
parameter in apply()
function.force
parameter when applying K8S resources in training operator.Lastest upgrade fails:
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.juju-log Encountered a conflict: Apply failed with 1 conflict: conflict with "python-httpx" using rbac.authorization.k8s.io/v1: .rules
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready Error in sys.excepthook:
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready Traceback (most recent call last):
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready self.emit(record)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/ops/log.py", line 41, in emit
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready self.model_backend.juju_log(record.levelname, self.format(record))
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready return fmt.format(record)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready record.exc_text = self.formatException(record.exc_info)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready traceback.print_exception(ei[0], ei[1], tb, None, sio)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready for line in TracebackException(
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/traceback.py", line 617, in format
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready yield from self.format_exception_only()
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready stype = smod + '.' + stype
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Looks like it happens when pebble-ready event is handled. It cannot resolve and upgrade charm event handler never gets to run. I will add troubleshooting section to upgrade guide on how to recover, but we need to look into it. I think, pebble ready event handler needs rework.
Fix has been merged.
Failed to reach active/idle:
training-operator blocked: K8S resources creation failed
Jira
Merge into:
track/1.6
(release KF v1.7)main