Closed pmuellr closed 1 year ago
Pinging @elastic/kibana-alerting-services (Team:Alerting Services)
Pinging @elastic/kibana-core (Team:Core)
Yea, atm the licensing
plugin does not perform any kind of retry operation in case of network error, or if ES returns an error of any kind, meaning that that kind of (potentially temporary) upstream outages results on the license$
observable emitting an error until the next (successful) license fetch.
Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.
That's a good question. I think it would make sense to have the licensing
plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.
Maybe improving createLicensePoller
and/or createLicenseUpdate
to perform retries in case of such failures would make sense. The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).
cc @mshustov wdyt?
I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.
yeah, it makes sense for the licensing API to prevent these cascading side effects.
The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).
It might be better to extract this logic and provide it from the Elasticsearch service, maybe? All the network problems can be considered as an intermittent effect:
Not sure about others from https://github.com/elastic/kibana/blob/main/src/core/server/saved_objects/migrationsv2/actions/catch_retryable_es_client_errors.ts Also, we can limit the max number of retries with exponential backoff to prevent DDOS of the Elasticsearch server.
410
gets returned by some proxies (including on cloud)
I'm not sure about 401 / 403, but probably even if there's an authentication issue we want to retry because someone might fix the credentials.
We could probably use migrationRetryCallCluster
as a starting point. I agree exponential back-off would be good to add to it.
https://github.com/elastic/kibana/blob/test-es-deprecations/src/core/server/elasticsearch/client/retry_call_cluster.ts#L61-L92
Related to this, I recently stole the catch_retryable_es_client_errors
implementation to add similar error handling behavior to some idempotent API calls that we make in Fleet: https://github.com/elastic/kibana/pull/118587/files#diff-33164209fde96e38b4365a9902a919b9ba5f07e8078dbf0945c4001603e3cefd
It'd be nice if we had a single source of truth for which types of errors are 'transient' and should be retried.
I found the source of why we added the 401/403 handling https://github.com/elastic/kibana/issues/51242#issuecomment-557073980
So it's particularly relevant to migrations because they're some of the first calls to Elasticsearch and I guess it's annoying if you're setting up a cluster and Kibana keeps crashing while you're still busy setting up credentials for Elasticsearch.
But in general, I'd say if an operation fails we should rather retry it on 401/403 than to error and just drop whatever data should have been written to Elasticsearch.
Got it, that's about what I was expecting to be the reason. For our case, we're not making any of these API calls until start
anyways, where SO migrations have already succeeded. For that reason, I chose not to retry 401s and 403s.
Thanks for doing some digging, @rudolf
Linking to https://github.com/elastic/kibana/issues/169788#issuecomment-1780772957, as the recent discussions in the linked are related to this one and a potential fix could close the current issue.
We currently do license checks when running alerting rules, to ensure the rules can be run with the given license. We track the licensing state in this code: x-pack/plugins/alerting/server/lib/license_state.ts.
We've see errors in logs occasionally, especially when Kibana is under stress, where the license information gets set internally in our code as "unavailable". From looking at the alerting code, I think this is coming in from the license subscription we are subscribed to. Typically these "outages" can be fairly short periods, like 20 seconds. But for that 20 seconds, rules will fail with logged messages like:
Once the license info is available again, everything works normally.
It feels like a 20 second "outage" in licensing shouldn't actually cause the rules to fail. We could assume the license was the same since last checked. Not sure what the limit on that time frame would be. Or maybe if it goes from anything -> unavailable, you just assume it's the most recent license seen.
Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.