[licensing] intermittent "license is not available" causing alerting rules to fail to execute

pmuellr commented 3 years ago

We currently do license checks when running alerting rules, to ensure the rules can be run with the given license. We track the licensing state in this code: x-pack/plugins/alerting/server/lib/license_state.ts.

We've see errors in logs occasionally, especially when Kibana is under stress, where the license information gets set internally in our code as "unavailable". From looking at the alerting code, I think this is coming in from the license subscription we are subscribed to. Typically these "outages" can be fairly short periods, like 20 seconds. But for that 20 seconds, rules will fail with logged messages like:

Alert type siem.signals is disabled because license information is not available at this time

Once the license info is available again, everything works normally.

It feels like a 20 second "outage" in licensing shouldn't actually cause the rules to fail. We could assume the license was the same since last checked. Not sure what the limit on that time frame would be. Or maybe if it goes from anything -> unavailable, you just assume it's the most recent license seen.

Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.

elasticmachine commented 3 years ago

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

elasticmachine commented 3 years ago

Pinging @elastic/kibana-core (Team:Core)

pgayvallet commented 3 years ago

Yea, atm the licensing plugin does not perform any kind of retry operation in case of network error, or if ES returns an error of any kind, meaning that that kind of (potentially temporary) upstream outages results on the license$ observable emitting an error until the next (successful) license fetch.

https://github.com/elastic/kibana/blob/3c8fa527a7765723d3851afe067315359a4763e7/x-pack/plugins/licensing/server/plugin.ts#L203-L213

Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.

That's a good question. I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

Maybe improving createLicensePoller and/or createLicenseUpdate to perform retries in case of such failures would make sense. The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).

cc @mshustov wdyt?

mshustov commented 3 years ago

I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

yeah, it makes sense for the licensing API to prevent these cascading side effects.

The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).

It might be better to extract this logic and provide it from the Elasticsearch service, maybe? All the network problems can be considered as an intermittent effect:

NoLivingConnectionsError
ConnectionError
TimeoutError
503 ServiceUnavailable
408 Request timeout

Not sure about others from https://github.com/elastic/kibana/blob/main/src/core/server/saved_objects/migrationsv2/actions/catch_retryable_es_client_errors.ts Also, we can limit the max number of retries with exponential backoff to prevent DDOS of the Elasticsearch server.

rudolf commented 3 years ago

410 gets returned by some proxies (including on cloud)

I'm not sure about 401 / 403, but probably even if there's an authentication issue we want to retry because someone might fix the credentials.

We could probably use migrationRetryCallCluster as a starting point. I agree exponential back-off would be good to add to it. https://github.com/elastic/kibana/blob/test-es-deprecations/src/core/server/elasticsearch/client/retry_call_cluster.ts#L61-L92

joshdover commented 2 years ago

Related to this, I recently stole the catch_retryable_es_client_errors implementation to add similar error handling behavior to some idempotent API calls that we make in Fleet: https://github.com/elastic/kibana/pull/118587/files#diff-33164209fde96e38b4365a9902a919b9ba5f07e8078dbf0945c4001603e3cefd

It'd be nice if we had a single source of truth for which types of errors are 'transient' and should be retried.

rudolf commented 2 years ago

I found the source of why we added the 401/403 handling https://github.com/elastic/kibana/issues/51242#issuecomment-557073980

So it's particularly relevant to migrations because they're some of the first calls to Elasticsearch and I guess it's annoying if you're setting up a cluster and Kibana keeps crashing while you're still busy setting up credentials for Elasticsearch.

But in general, I'd say if an operation fails we should rather retry it on 401/403 than to error and just drop whatever data should have been written to Elasticsearch.

joshdover commented 2 years ago

Got it, that's about what I was expecting to be the reason. For our case, we're not making any of these API calls until start anyways, where SO migrations have already succeeded. For that reason, I chose not to retry 401s and 403s.

Thanks for doing some digging, @rudolf

pgayvallet commented 1 year ago

Linking to https://github.com/elastic/kibana/issues/169788#issuecomment-1780772957, as the recent discussions in the linked are related to this one and a potential fix could close the current issue.

elastic / kibana

[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394