canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
12 stars 7 forks source link

[DPE-4049][BUGFIX] Exception when offline unit is missing `node.roles` from static config #412

Closed juditnovak closed 2 months ago

juditnovak commented 3 months ago

Issue

https://github.com/canonical/opensearch-operator/issues/222

Context (as highlighted in the issue comment):

"When a unit is online, unit-specific service information is taken from the Opensearch API (/_nodes).

However if the unit is offline, we fall back to static configuration. This is the part where we had the issue, in case node.roles was missing from the configuration."

Solution

We shouldn't directly refer to a field that may be missing from the config.

NOTE: Since the opensearch_distro had no unittests, this PR is also adding a new unttest module dedicated to opensearch_distro. For the time being only the concerned current() method is being tested (both for happy and broken path -- the latter was the target of the fix.) The test module should be extended in the future for additional test coverage for the opensearch_distro module.

NOTE2: The PR is including a global fix for Opensearch Unit Tests, as a global mock was applied (in a hidden corner) imacting ALL tests.

juditnovak commented 2 months ago

I'm doing my best to reproduce the issue for deeper understanding of the original cause, but I'm struggling to get the failure again.

Screenshot from 2024-09-02 13-07-53

juditnovak commented 2 months ago

@Mehdi-Bendriss your suggestions (overall a new version of the original proposal) have been all committed.

I'm confused why you've proposed this code: I do not undestand underlying workflows (further than the instuctions I've been following from earlier comment https://github.com/canonical/opensearch-operator/pull/412#discussion_r1729248248). I'm interested to learn about them (so next time my contribution could be more useful than it seems to be here :-( )

Your proposal introduces a circular import, as ClusterTopology itself internally uses OpenSearchDistributon (the class we are modifying).

image

Do you have another proposal? Would you like to give highlights of underlying logic, so I could have one?

juditnovak commented 2 months ago

Summarizing offline discussion and consequences:

@Mehdi-Bendriss suggested to use keep the circular import. Implemented in commit https://github.com/canonical/opensearch-operator/pull/412/commits/f282ef4aeec756f2cfa4107df7423e89d83f0b97 Integration tests show no issues, on a superficial check the disabled config unittest may. Corresponding unittests disabled, see error in comment.

I believe that we should avoid the circular imports and keep clear integrity of the ClusterTopology object relying on the OpenSearchDistribution as suggested by the original design.

By moving the hard-coded list of default roles into an application-wide constant, this is simple to do. Note that this solution had NO interference with any tests (which should be a signal of a healthy internal refactor).

This proposed solution is added in commit https://github.com/canonical/opensearch-operator/pull/412/commits/d227aa9c48e540edc6b86a492cd15ae9962833e6. (In case circular import is preferred, we just need to roll back this commit.)

Question (@Mehdi-Bendriss, @reneradoi , @phvalguima ): The current solution is ignoring the Deployment Description stored in the databag if the the application was NOT started with WITH_PROVIDED_ROLES. I understand that this is how the application is normally configured. Question: is it possible that throughout runtime of the application the Deployment Description roles (stored in the databag) may change? If this scenario is impossible, I'm changing the corresponding new unittest. Thank you.

juditnovak commented 2 months ago

Summarizing offline discussion: Deployment Description is holding the "desired state", not the current state. The current state is available in the static config. (This also explains why it's safer to back to Deployment Description defaults rather than actual values, in case static config may not be available.)

All this wisdom is "re-cycled" into the code, in the shape of helpful comments around the workflow :-)