canonical / hardware-observer-operator

A charm to setup prometheus exporter for IPMI, RedFish and RAID devices from different vendors.
Apache License 2.0
7 stars 14 forks source link

Charm in error state after configuring redfish creds #190

Closed dashmage closed 4 months ago

dashmage commented 4 months ago

Deploy ubuntu, hardware-observer and grafana-agent on a machine with redfish.

hardware-observer revision is 48

juju deploy ubuntu
juju deploy hardware-observer --channel edge
juju deploy grafana-agent

juju relate ubuntu grafana-agent
juju relate ubuntu hardware-observer

# status blocked due to missing storcli resource
juju attach-resource hardware-observer storcli-deb=./storcli.deb 

Here are the juju debug logs

unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log hw_white_list: [<HWTool.STORCLI: 'storcli'>, <HWTool.IPMI_SENSOR: 'ipmi_sensor'>, <HWTool.IPMI_SEL: 'ipmi_sel'>, <HWTool.IPMI_DCMI: 'ipmi_dcmi'>, <HWTool.REDFISH: 'redfish'>]
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Skip fetch tool: HWTool.PERCCLI
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Skip fetch tool: HWTool.SAS2IRCU
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Skip fetch tool: HWTool.SAS3IRCU
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Install deb package HWTool.STORCLI from /var/lib/juju/agents/unit-hardware-observer-2/resources/storcli.deb success
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Strategy <hw_tools.StorCLIStrategy object at 0x7fd9a30e4e20> install success
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Strategy <hw_tools.IPMISELStrategy object at 0x7fd9a30e5120> install success
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Strategy <hw_tools.IPMIDCMIStrategy object at 0x7fd9a30e5fc0> install success
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Strategy <hw_tools.IPMISENSORStrategy object at 0x7fd9a30e5ba0> install success
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Strategy <hw_tools.RedFishStrategy object at 0x7fd9a30e4f40> install success
unit-hardware-observer-2: 17:17:37 INFO unit.hardware-observer/2.juju-log Attempt 1 of /redfish/v1/
unit-hardware-observer-2: 17:17:38 INFO unit.hardware-observer/2.juju-log Response Time for GET to /redfish/v1/: 0.12259700102731586 seconds.
unit-hardware-observer-2: 17:17:38 INFO unit.hardware-observer/2.juju-log Attempt 1 of /redfish/v1/SessionService/Sessions
unit-hardware-observer-2: 17:17:39 INFO unit.hardware-observer/2.juju-log Response Time for POST to /redfish/v1/SessionService/Sessions: 1.3072235389845446 seconds.
unit-hardware-observer-2: 17:17:39 INFO unit.hardware-observer/2.juju-log Login returned code 401: {"error":{"@Message.ExtendedInfo":[{"MessageId":"ExtendedError.1.2.InvalidCredentials","Resolution":"Please request again with correct credentials.","MessageSeverity":"Critical","Message":"The login credentials is invalid.","MessageArgs":[],"@odata.type":"#Message.v1_1_0.Message"}],"message":"A general error has occurred. See ExtendedInfo for more information.","code":"Base.1.8.GeneralError"}}

unit-hardware-observer-2: 17:17:39 ERROR unit.hardware-observer/2.juju-log invalid redfish credential: HTTP 401 Unauthorized returned: Invalid credentials supplied
unit-hardware-observer-2: 17:17:39 ERROR unit.hardware-observer/2.juju-log Invalid redfish credentials.

The charm should ideally be in blocked status because of the invalid redfish credentials with the message: Invalid config: 'redfish-username' or 'redfish-password'. But the charm actually gets blocked with the message Missing relation: [cos-agent]. This is because the blocked status message for invalid redfish creds has been overwritten by the missing relation message while executing the update status hook.

Now relate the hardware-observer and grafana-agent charms.

juju relate hardware-observer grafana-agent

This does not change the existing blocked status message for missing cos-agent relation immediately. Only once the update hook runs again (after 5m by default) does the charm's status change to the invalid redfish credentials one.

unit-grafana-agent-0: 17:46:48 INFO juju.worker.uniter.operation ran "cos-agent-relation-created" hook (via hook dispatching script: dispatch)
unit-hardware-observer-3: 17:46:49 INFO juju.worker.uniter.operation ran "cos-agent-relation-created" hook (via hook dispatching script: dispatch)
unit-grafana-agent-0: 17:46:49 INFO juju.worker.uniter.operation ran "cos-agent-relation-joined" hook (via hook dispatching script: dispatch)
unit-hardware-observer-3: 17:46:50 INFO unit.hardware-observer/3.juju-log cos-agent:9: Defer cos-agent relation join because exporter or resources is not ready yet.
unit-hardware-observer-3: 17:46:50 INFO juju.worker.uniter.operation ran "cos-agent-relation-joined" hook (via hook dispatching script: dispatch)
unit-grafana-agent-0: 17:46:51 INFO juju.worker.uniter.operation ran "cos-agent-relation-changed" hook (via hook dispatching script: dispatch)
unit-hardware-observer-3: 17:46:51 INFO unit.hardware-observer/3.juju-log cos-agent:9: Defer cos-agent relation join because exporter or resources is not ready yet.
unit-hardware-observer-3: 17:46:52 INFO juju.worker.uniter.operation ran "cos-agent-relation-changed" hook (via hook dispatching script: dispatch)
unit-grafana-agent-0: 17:46:53 INFO juju.worker.uniter.operation ran "cos-agent-relation-changed" hook (via hook dispatching script: dispatch)
unit-grafana-agent-0: 17:47:02 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-pin-ubuntu-0: 17:47:12 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

Now set the redfish credential config options

juju config hardware-observer redfish-username="user" redfish-password="password"

The config change hook fails and the charm goes into error state with the following logs,

unit-hardware-observer-3: 17:52:10 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:10 ERROR unit.hardware-observer/3.juju-log Failed to run 'check_health'
unit-hardware-observer-3: 17:52:10 WARNING unit.hardware-observer/3.juju-log Exporter health check - failed.
unit-hardware-observer-3: 17:52:10 WARNING unit.hardware-observer/3.juju-log Restarting exporter - 1 retry
unit-hardware-observer-3: 17:52:10 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:10 ERROR unit.hardware-observer/3.juju-log Failed to run 'restart'
unit-hardware-observer-3: 17:52:13 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:13 ERROR unit.hardware-observer/3.juju-log Failed to run 'check_active'
unit-hardware-observer-3: 17:52:13 WARNING unit.hardware-observer/3.juju-log Restarting exporter - 2 retry
unit-hardware-observer-3: 17:52:13 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:13 ERROR unit.hardware-observer/3.juju-log Failed to run 'restart'
unit-hardware-observer-3: 17:52:16 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:16 ERROR unit.hardware-observer/3.juju-log Failed to run 'check_active'
unit-hardware-observer-3: 17:52:16 WARNING unit.hardware-observer/3.juju-log Restarting exporter - 3 retry
unit-hardware-observer-3: 17:52:16 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:16 ERROR unit.hardware-observer/3.juju-log Failed to run 'restart'
unit-hardware-observer-3: 17:52:19 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:19 ERROR unit.hardware-observer/3.juju-log Failed to run 'check_active'
unit-hardware-observer-3: 17:52:19 ERROR unit.hardware-observer/3.juju-log Exporter is not installed properly.
unit-hardware-observer-3: 17:52:19 ERROR unit.hardware-observer/3.juju-log Failed to run 'check_active'
unit-hardware-observer-3: 17:52:19 ERROR unit.hardware-observer/3.juju-log Failed to restart the exporter.
unit-hardware-observer-3: 17:52:19 ERROR unit.hardware-observer/3.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/model.py", line 2955, in _run
    result = subprocess.run(args, **kwargs)  # type: ignore
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-hardware-observer-3/status-set', '--application=False', 'error', 'Exporter crashed unexpectedly, please refer to systemd logs...')' returned non-zero exit status 2.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/./src/charm.py", line 346, in <module>
    ops.main(HardwareObserverCharm)  # type: ignore
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/main.py", line 451, in __call__
    return main(charm_class, use_juju_for_storage=use_juju_for_storage)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/./src/charm.py", line 229, in _on_config_changed
    self._on_update_status(event)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/./src/charm.py", line 148, in _on_update_status
    self.model.unit.status = restart_status
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/model.py", line 541, in status
    self._backend.status_set(value.name, value.message, is_app=False)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/model.py", line 3157, in status_set
    self._run('status-set', f'--application={is_app}', status, message)
  File "/var/lib/juju/agents/unit-hardware-observer-3/charm/venv/ops/model.py", line 2957, in _run
    raise ModelError(e.stderr) from e
ops.model.ModelError: ERROR invalid status "error", expected one of [maintenance blocked waiting active]

unit-hardware-observer-3: 17:52:19 ERROR juju.worker.uniter.operation hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1
unit-hardware-observer-3: 17:52:19 INFO juju.worker.uniter awaiting error resolution for "config-changed" hook
rgildein commented 4 months ago

I think this issue is due to using ErrorStatus, which should not be used as it mention in description here. At-least the error showing in logs should not be like this.

From my knowledge, the charm should never use it on it's own.

dashmage commented 4 months ago

Good catch, it's setting the ErrorStatus as part of the restart_exporter method here. This should probably be replaced with BlockedStatus. Raising the exception directly as mentioned in Robert's comment might be a better idea since if the exporter crashed, the user probably can't do too much to get hardware observer functional again.

rgildein commented 4 months ago

Blocked status or raising an exception, which will end up as ErrorState, but not setting it directly.

chanchiwai-ray commented 4 months ago

Did you check why this exporter failed to restart? My guess is that the charm did not install the exporter at all because it did not pass the validation on_install / on_upgrade event. For quick fix, we can consider removing the config validation step on install and upgrade events because config_changed event will be followed by those events, and the config validation is done there as well

chanchiwai-ray commented 4 months ago

You also mentioned another issue: the charm status is overwritten on_update_status. We should consider using on_collect_unit_status, and avoid a callback of self._on_update_status(event) on each event.

dashmage commented 4 months ago

My guess is that the charm did not install the exporter at all because it did not pass the validation on_install / on_upgrade event. For quick fix, we can consider removing the config validation step on install and upgrade events because config_changed event will be followed by those events, and the config validation is done there as well

Yep I traced the logic and that's why. Since it doesn't pass validation on install, all the exporter related functions fail (check_health, check_active etc) and this causes the error to be raised. The quick fix idea is good but the logic flow is very convoluted currently. We might need to revisit the lifecycle again later to streamline it.

Edit: Got an even simpler fix which might be better. When running exporter.check_health we aren't checking whether the exporter is installed or not. If we do that, then we don't end up having this problem.

You also mentioned another issue: the charm status is overwritten on_update_status. We should consider using on_collect_unit_status, and avoid a callback of self._on_update_status(event) on each event.

Coincidentally, I was going through the same idea and was reading up the docs for it so I can raise another PR. That would definitely help.

Thanks @chanchiwai-ray for your inputs :smile: