[DPE 4043] Bugixes for issues 155, 156, 158, 159, 160

canonical / data-platform-libs

A collection of charm libraries curated by the Data Platform Team

https://charmhub.io/data-platform-libs

Apache License 2.0

10 stars 9 forks source link

[DPE 4043] Bugixes for issues 155, 156, 158, 159, 160 #157

Closed juditnovak closed 5 months ago

juditnovak commented 5 months ago

NOTE: Also testsed on

Postgres: https://github.com/canonical/pgbouncer-operator/pull/207
MySQL: https://github.com/canonical/mysql-router-operator/pull/127
- ignore the obvious "Check Libs" failure
Zookeeper: https://github.com/canonical/zookeeper-operator/pull/129

juditnovak commented 5 months ago

@welpaolo @marceloneppel

The PR is purposefully split in neat commits, DON'T review it all as once, it's chaotic that way!!!!!!! Separate commits allow to isolate different changes on different issues, specifically to avoid reviewer confusion.

(I think this is a way more efficient, than havnig 6 libversion changes, and the overhead of 6 PRs... Not to mention that multiple of the issues is critically brlocking the team.)

juditnovak commented 5 months ago

@marceloneppel

@juditnovak, can you provide more details of the solution for https://github.com/canonical/data-platform-libs/issues/156? It wasn't very easy to understand the fix because it was mixed with the fixes for 155 and 159 in one single commit.

Huh, indeed, there's QUITE a bit of embedded context here :sweat_smile:

Let me try to summarize (fast?).

There was stone-age, all we had was plain text databag. We already had a set_secret()/get_secret() method in our charms, but in reality we were only interacting with the databag. i.e. we had our "secrets" such as {'username': 'mysuser', 'password': 'plain_text_passord!'}
Then Juju Secrets got introduced. Then we changed our get_secret()/set_secret() methods to add sensitive info to a Juju secret instead of the databag. We saved secret references in a databag field such as {'internal-secret': 'secret://<model-uuid>/<xid>'} (To make it a bit worse, we had no libs to deal with this at the time. Thus each developer can have their own internal-secret field called as they preferred.)
Then we started to use Juju Secret labels. Since then we're not saving anything to the databag.

Now, this is the thing. We need to support both rolling upgrades and rollbacks (typically in case an upgrade turns out to have "unwanted" (mis)features) soon after.

We have to make sure that our code can cope with "any" kind of (legacy) data it may find from one second to aonther. NOT only the "perfect" environment we set up from the start. We may be "randmly dropped in the middle of a situation" (i.e upgrade from an old charm version) where we have an whatever-dev-calls-their-internal-secret: <secert URI> instead of labels we're using these days.

We still need to be able to find our sensitive info in that secret, and stick label on the secret for further usage. (Also to remove the databg field on the first update operation -- as backwards compatibility is only ensured up to that point).

Now this is exactly where we had the bug :-)

juditnovak commented 5 months ago

BTW this whole scenario is very well tested by PGbouncer, where I partially developed this code (see CI pipeline in the Description)

juditnovak commented 5 months ago

@welpaolo @marceloneppel Since I see both of you alighend on "don't dump this many fixes in a PR" I'm honestly asking: Did you Guys give it a try to check commit-by-commit? Was that equally giving you a confusing experience?

I'm asking because I purposefully did that to facilitate your position, and it's important info for me if it was not helpful.

(I had 2 reasons to do this: there's quite an overhead per PR. This way I could reduce the overhead. Not to mention re-bases required when keeping 5-6 parallel PRs in sync.. Overall what I mean is that if I broke this down into 5-6 PRs, the same job wouldn't have happened :sweat_smile: While some of the changes were critical/blocking... :-/ Secondly, I think it's a bit of an overkill to introduce a new version, just because of turninig an ERROR into a WARNING... or fix and internal backwards compatibility bug.)

Yet, of course, I'd like to find the balance in the sense that review doesn't become a nightmare (which I thought I ensured with neat commits). This has worked well for me in the past (both as a reviewer and as a developer), but of course now the question is whether it's good for you or not :-)

marceloneppel commented 5 months ago

@marceloneppel

@juditnovak, can you provide more details of the solution for #156? It wasn't very easy to understand the fix because it was mixed with the fixes for 155 and 159 in one single commit.

Huh, indeed, there's QUITE a bit of embedded context here 😅

Let me try to summarize (fast?).

There was stone-age, all we had was plain text databag. We already had a set_secret()/get_secret() method in our charms, but in reality we were only interacting with the databag. i.e. we had our "secrets" such as {'username': 'mysuser', 'password': 'plain_text_passord!'}

Then Juju Secrets got introduced. Then we changed our get_secret()/set_secret() methods to add sensitive info to a Juju secret instead of the databag. We saved secret references in a databag field such as {'internal-secret': 'secret://<model-uuid>/<xid>'} (To make it a bit worse, we had no libs to deal with this at the time. Thus each developer can have their own internal-secret field called as they preferred.)

Then we started to use Juju Secret labels. Since then we're not saving anything to the databag.

Now, this is the thing. We need to support both rolling upgrades and rollbacks (typically in case an upgrade turns out to have "unwanted" (mis)features) soon after.

We have to make sure that our code can cope with "any" kind of (legacy) data it may find from one second to aonther. NOT only the "perfect" environment we set up from the start. We may be "randmly dropped in the middle of a situation" (i.e upgrade from an old charm version) where we have an whatever-dev-calls-their-internal-secret: <secert URI> instead of labels we're using these days.

We still need to be able to find our sensitive info in that secret, and stick label on the secret for further usage. (Also to remove the databg field on the first update operation -- as backwards compatibility is only ensured up to that point).

Now this is exactly where we had the bug :-)

Thanks for explaining this topic, too, Judit. With the third point from your comment, I could better understand the difference between the fix for #156 and the other fixes from the same commit. Previously, I was considering other changes not related to it when reviewing that issue-specific fix.

marceloneppel commented 5 months ago

@welpaolo @marceloneppel Since I see both of you alighend on "don't dump this many fixes in a PR" I'm honestly asking: Did you Guys give it a try to check commit-by-commit? Was that equally giving you a confusing experience?

I'm asking because I purposefully did that to facilitate your position, and it's important info for me if it was not helpful.

(I had 2 reasons to do this: there's quite an overhead per PR. This way I could reduce the overhead. Not to mention re-bases required when keeping 5-6 parallel PRs in sync.. Overall what I mean is that if I broke this down into 5-6 PRs, the same job wouldn't have happened 😅 While some of the changes were critical/blocking... :-/ Secondly, I think it's a bit of an overkill to introduce a new version, just because of turninig an ERROR into a WARNING... or fix and internal backwards compatibility bug.)

Yet, of course, I'd like to find the balance in the sense that review doesn't become a nightmare (which I thought I ensured with neat commits). This has worked well for me in the past (both as a reviewer and as a developer), but of course now the question is whether it's good for you or not :-)

Regarding this topic, I used your commitment to review the PR. Thanks for separating in that way. I think it's extremely important to do that way when we have multiple changes in the same PR. What I missed initially was only a little bit more context about #156 and the way you have fixed it. The commit containing multiple fixes also made me struggle a little bit. The latter explanation you added, specifically the third item, made me understand it clearly.

juditnovak commented 5 months ago

@welpaolo In response to https://github.com/canonical/data-platform-libs/pull/157#discussion_r1570948486

Another thing about this secret label concept...

Just to have a quick recap: secrets don't come with a reliable ID that we could safely use (even cross model!). Thus we need to stick a so called 'label' on them at the first time we add/fetch them... Otherwise it may be difficult (impossible?) to refer to them later.

Now, once you've stuck a label on a secret, there's no way to "replace it with another". Label can be added at secret creation and fetch. Later on we can modify secret contetns, but we have no support (official, documented, maintained method) for changing metadata. It may be even impossible?

Therefore when we introduce a change of secret labels (like done in this PR) we have to make sure that rolling upgrades are still safe for rollbacks. Meaning that we apply the policy of "leaving existing data intact (allowing for a smooth rollback) until the 1st write operation"

This is the point to realize that secret labels (!!!!!!!) are part of Relational Data. No, NOT only databag, NOT only secrets content. The label as well.

The reason is the following. Since we don't have qualified IDs, our only 'grip' to a secret is it's label ("sticker"). We can't retreive or recognize it otherwise. Now, if our "sticker" changes shape, we won't be able to retrieve the secret that was attached to another shaped "sticker".

Therefore we have to make sure 2 things.

We are keeping track of "old stickers" (see the _previous_labels() function)
On the first write operation we don't just write content, BUT also make sure that the new label "takes impact".

Now, since we have no way to change secret labels, we do this via dropping the old secret and creating a new one with the same content.

Huh, so this is the rationale behind the function you've been curious about :-)

juditnovak commented 5 months ago

@welpaolo

In the future I would like to have more integration tests to check for backward compatibility.

I completely agree with you on that.

One important note here is the Rolling Upgrades test module, that I've added about half a year ago, and that IS testing upgreades from ancient version v18 (before secrets) of the lib to the current one.

These tests probably should be run using multiple, "significant" old versions of the lib, including "latest" before current modifications. This is a simple thing to add, though I have a way better improvement in mind (based on discussions with @taurus-forever ... I just never got there to schedule my "rusting" ticket https://warthogs.atlassian.net/browse/DPE-2994 on a sprint :-( )

The constraint really is time -- as integration tests are elaborate, and backwards compatible logic is more than "taking over an old version of the library". As the typically those legacy behaviors addressed in this PR were before we started to use the lib for local secrtets. The legacy logic was in each individual charm (luckily conceptually equal though).

The way I'm "cutting corners" is to use various other charms CI to test new versions of the lib. Generally the charm that raised the issue/feature_req.

Now given the volume of issues here, there were multiple charms impacted. (Out of which I'd highlight PGBouncer, that has very good backwars compatibilty tests), combined with the non-negligable unittesting.... I think we should be in a decently safe place.