canonical / training-operator

Kubeflow Training Operator
Apache License 2.0
4 stars 6 forks source link

upgrade from 1.5 to 1.6 intermittently fails due to 409 conflict during k8s resource creation #104

Closed ca-scribner closed 1 year ago

ca-scribner commented 1 year ago

Reproduce by:

juju deploy training-operator --channel 1.5/stable --trust
# wait to settle
juju refresh training-operator --channel 1.6/stable --trust

training-operator will appear in juju status as constantly working and in MaintenanceStatus, and logs will show it repeatedly trying to resolve the pebble-ready event but ending with a 409 conflict error.

oddly, if we then

juju remove-application training-operator
juju deploy training-operator --channel 1.5/stable --trust
# wait to settle
juju refresh training-operator --channel 1.6/stable --trust

training-operator 1.6 deploys successfully. Unclear if this is something to do with the event order, or some other inconsistency

ca-scribner commented 1 year ago

This happens if our field_manager does not match that which kubernetes says last applied to the resources. We previously tried to force on only the upgrade event, thinking that doing this would convert the resources to our current lightkube_manager, but for unknown reasons this does not appear to be enough.

canonical/charmed-kubeflow-chisme#65 proposes that we always force during lightkube applies. fwiw, ops-lib-manifest also forces on all applies

DnPlas commented 1 year ago

This issue is likely caused by the training-operator 1.5 using patch() without specifying a fieldManager, and the type of patch strategy being set to patch_type=PatchType.MERGE. This may play a role in the issue because training-operator 1.6, uses apply() with a defined fieldManager.

To consider

Adding some logs and my steps to reproduce:

Steps to reproduce:

  1. Deploy kubeflow 1.6/stable
  2. Wait for training-operator to be active and idle
  3. juju refresh training-operator 1.6/stable

Logs:

unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.juju-log Encountered a conflict: Apply failed with 1 conflict: conflict with "python-httpx" using rbac.authorization.k8s.io/v1: .rules
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Error in sys.excepthook:
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Traceback (most recent call last):
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     self.emit(record)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/ops/log.py", line 41, in emit
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     self.model_backend.juju_log(record.levelname, self.format(record))
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     return fmt.format(record)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     record.exc_text = self.formatException(record.exc_info)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     traceback.print_exception(ei[0], ei[1], tb, None, sio)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     for line in TracebackException(
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/traceback.py", line 617, in format
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     yield from self.format_exception_only()
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     stype = smod + '.' + stype
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready 
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Original exception was:
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready Traceback (most recent call last):
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     resp.raise_for_status()
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready     raise HTTPStatusError(message, request=request, response=self)
unit-training-operator-0: 11:00:26 WARNING unit.training-operator/0.training-operator-pebble-ready httpx.HTTPStatusError: Client error '409 Conflict' for url 'https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-training-operator-charm?fieldManager=lightkube'
NohaIhab commented 1 year ago

Tested the upgrade from 1.5 to 1.6 with #105 fix force apply fixes the issue, the event sequence recorded with jhack:

┃ timestamp ┃ training-operator/0            ┃                                                                         
│ 12:13:19  │ training_operator_pebble_ready │                                                                         
│ 12:13:05  │ start                          │                                                                         
│ 12:13:03  │ config_changed                 │                                                                         
│ 12:12:48  │ upgrade_charm                  │                                                                         
│ 12:12:15  │ stop 
NohaIhab commented 1 year ago

fixed by #105