Closed ca-scribner closed 1 year ago
This happens if our field_manager does not match that which kubernetes says last applied to the resources. We previously tried to force
on only the upgrade event, thinking that doing this would convert the resources to our current lightkube_manager, but for unknown reasons this does not appear to be enough.
canonical/charmed-kubeflow-chisme#65 proposes that we always force
during lightkube applies. fwiw, ops-lib-manifest also forces on all applies
This issue is likely caused by the training-operator 1.5 using patch()
without specifying a fieldManager, and the type of patch strategy being set to patch_type=PatchType.MERGE
. This may play a role in the issue because training-operator 1.6, uses apply()
with a defined fieldManager.
training-operator/0* maintenance executing 10.1.15.20 Creating K8S resources
force=True
to resolve the conflict, but the pebbleReady event may happen before the upgrade (?) causing conflictsAdding some logs and my steps to reproduce:
Steps to reproduce:
juju refresh training-operator 1.6/stable
Logs:
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.juju-log Encountered a conflict: Apply failed with 1 conflict: conflict with "python-httpx" using rbac.authorization.k8s.io/v1: .rules
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Error in sys.excepthook:
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Traceback (most recent call last):
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready self.emit(record)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/ops/log.py", line 41, in emit
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready self.model_backend.juju_log(record.levelname, self.format(record))
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready return fmt.format(record)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready record.exc_text = self.formatException(record.exc_info)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready traceback.print_exception(ei[0], ei[1], tb, None, sio)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready for line in TracebackException(
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/traceback.py", line 617, in format
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready yield from self.format_exception_only()
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready stype = smod + '.' + stype
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Original exception was:
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Traceback (most recent call last):
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready resp.raise_for_status()
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready raise HTTPStatusError(message, request=request, response=self)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready httpx.HTTPStatusError: Client error '409 Conflict' for url 'https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-training-operator-charm?fieldManager=lightkube'
Tested the upgrade from 1.5 to 1.6 with #105 fix force apply fixes the issue, the event sequence recorded with jhack:
┃ timestamp ┃ training-operator/0 ┃
│ 12:13:19 │ training_operator_pebble_ready │
│ 12:13:05 │ start │
│ 12:13:03 │ config_changed │
│ 12:12:48 │ upgrade_charm │
│ 12:12:15 │ stop
fixed by #105
Reproduce by:
training-operator will appear in
juju status
as constantly working and inMaintenanceStatus
, and logs will show it repeatedly trying to resolve thepebble-ready
event but ending with a 409 conflict error.oddly, if we then
training-operator 1.6 deploys successfully. Unclear if this is something to do with the event order, or some other inconsistency