Provide a way for administrators to identify "undecryptable" saved objects

jportner commented 2 years ago

If you have some encrypted saved objects in Kibana, and then you accidentally change the encryption key (without adding the current key to decryptionOnlyKeys), you'll have partial data loss. Unfortunately, today this is hard to triage, because objects can get into a mix of two different "undecryptable" states.

1) Can't decrypt on read --

When you try to read an encrypted saved object, Kibana first decrypts it. If Kibana doesn't have the right key available (either in xpack.encryptedSavedObjects.encryptionKey or xpack.encryptedSavedObjects.keyRotation.decryptionOnlyKeys), then Kibana will fail to decrypt the object. It won't change the object but it will log an error message and either (a) return the object with the attributes stripped out, or (b) throw an error, depending on what the consumer has specified.

So in this scenario, the object is unchanged -- it still has encrypted attributes that Kibana cannot read.

2) Can't decrypt on migrate --

When Kibana first starts up after an upgrade, it will attempt to migrate all saved objects to their latest versions. The thing is, not all objects are migrated in every Kibana version. So this really depends on what version of Kibana you are running before/after. However, if Kibana detects that it needs to migrate an encrypted saved object, it will attempt to do so -- if it cannot (because, again, it doesn't have the right key available), then it will strip* the encrypted attributes from the object and reindex it.

So in this scenario, the object is changed -- it no longer has any encrypted attributes.

*Note: this "stripping" behavior is defined by the consumer, and it happens for all encrypted saved object migrations today, but there could be some migrations in the future that choose not to do this.

Potential solution

This is a (hopefully) rare situation for a Kibana operator to find themselves in. Still, to aid in support, we should provide an API to identify all "undecryptable" saved objects. The current Rotate encryption key API provides much of the desired behavior for this, but 1. it doesn't currently treat undecryptable objects as "failures", 2. it doesn't attempt to decrypt objects using the primary encryption key (so you can't differentiate objects that are currently in a good state), 3. it doesn't return object IDs for successes or failures, and 4. it simply skips objects that have no encrypted attributes (because there's nothing to decrypt or re-encrypt!)

I think we should probably enhance this API for this use case, perhaps by changing the shape of the response and by adding an option to exclude the primary encryption key (but leave that option disabled by default so this API is idempotent).

elasticmachine commented 2 years ago

Pinging @elastic/kibana-security (Team:Security)

pauldotpower commented 2 years ago

The behaviour during upgrade seems like a concern if it happens frequently. If my saved object has encrypted attributes then I would expect them to be either not migrated (and an error or warning surfaced) or the encrypted attribute is migrated as is without decryption (though I can see how this would only work in simple cases)

jportner commented 2 years ago

The behaviour during upgrade seems like a concern if it happens frequently. If my saved object has encrypted attributes then I would expect them to be either not migrated (and an error or warning surfaced) or the encrypted attribute is migrated as is without decryption (though I can see how this would only work in simple cases)

In general I would agree, and that's why the consumer determines this behavior.

In this case, our only consumer is Alerting, and the Alerting rules only contain encrypted API keys. If the encrypted attribute is stripped, the rule reverts to a "Disabled" state, and we log some error messages telling the operator that they have a disabled Alerting rule. Then, the user can simply view the Alerting rule and re-enable it (which will generate a new API key and encrypt it).

I'm lacking some historical context on the decision to function like this, but my understanding is that we want to avoid failing the upgrade migration at all costs, so we chose this path as the lesser of two evils. Perhaps @gmmorris or @mikecote has something to add.

mikecote commented 2 years ago

We've recently changed alerting migrations to fail the overall migration process to avoid data loss and/or corruption via additional authentication data (AAD) (https://github.com/elastic/kibana/pull/105968). Not only do alerting rules store API keys but connector saved-objects also contain sensitive information in their encrypted configuration (secrets). If we were to allow migrations to continue, we would break most (if not all) of the user's connectors and also leave them un-migrated because we do not have access to the encrypted attributes to migrate.

elastic / kibana