elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.81k stars 8.2k forks source link

Investigate feasibility of incrementally rolling out `csp.disableUnsafeEval` #143223

Closed legrego closed 1 year ago

legrego commented 2 years ago

The cloud_experiments plugin allows for us to incrementally roll out changes to Elastic Cloud deployments. On the surface, this feels like a convenient way for us to gain more confidence in our csp.disableUnsafeEval setting.

We should investigate: 1) If it's possible to use both the existing csp.disableUnsafeEval setting alongside a dynamic setting, like the one provided by cloud_experiments. 2) If cloud_experiments is the right tool for this job, or if we should look into other ways of rolling out this change.

ping @elastic/kibana-core in case you have any thoughts or insights into question 2 above.

elasticmachine commented 2 years ago

Pinging @elastic/kibana-security (Team:Security)

pgayvallet commented 2 years ago

in case you have any thoughts or insights into question 2 above

Don't mind if I also share my thoughts on question 1.

If it's possible to use both the existing csp.disableUnsafeEval setting alongside a dynamic setting, like the one provided by cloud_experiments.

FWIW, I don't like much having a core service depending on a non-core config property, as we will need to implement yet another hook / IOC API to allow the plugin to register to Core to communicate the value coming from its own config. IIRC disableUnsafeEval is managed strictly internally by Core's http service

If cloud_experiments is the right tool for this job, or if we should look into other ways of rolling out this change.

++. Why can't we just have the rolling be performed by changing the csp.disableUnsafeEval config value per-deployment in the first place?

Also, looking at the implementation of the cloud_experiments plugin, my observations:

  1. The HTTP/CSP config is parsed before plugins are even in their setup phase, so leveraging an IOC'ed plugin API at that moment wouldn't be a possibility. We would need somehow to refactor our CSP system to support after-init updates.

  2. There is no preboot counterpart of the plugin, which would be blocker to leverage the API in Core (as we're booting an http server during preboot too.)

  3. getVariation is async, which might be problematic given the way / moment we're parsing the CSP config

legrego commented 2 years ago

@pgayvallet, I appreciate you taking the time to give this some thought!

FWIW, I don't like much having a core service depending on a non-core config property, as we will need to implement yet another hook / IOC API to allow the plugin to register to Core to communicate the value coming from its own config. IIRC disableUnsafeEval is managed strictly internally by Core's http service

Yep that's a good point, thanks for raising this. I sometimes forget that not all of my team's code lives within x-pack plugins.

++. Why can't we just have the rolling be performed by changing the csp.disableUnsafeEval config value per-deployment in the first place?

My understanding is that the control plane can only turn this on/off on a per-version basis. So we could instruct the control plane to set this to true for all versions >= 8.6 (for example), but that's not really a rolling change, as it impacts all clusters past a certain version.

The HTTP/CSP config is parsed before plugins are even in their setup phase, so leveraging an IOC'ed plugin API at that moment wouldn't be a possibility. We would need somehow to refactor our CSP system to support after-init updates.

Yeah, other than the preboot issue you outline below, I don't think this would be too problematic to refactor, but you'd know a lot better than I would.

There is no preboot counterpart of the plugin, which would be blocker to leverage the API in Core (as we're booting an http server during preboot too.)

Bah, whose idea was it to introduce a preboot mode with its own http server? 😬 🙈. Preboot isn't really needed on ESS, so we could conveniently ignore this for the time being, and accept the fact that the preboot http server may use a different CSP than the actual http server. That said, the other issues you identified might make this a moot point.

getVariation is async, which might be problematic given the way / moment we're parsing the CSP config

Yeah I noticed that, too...

afharo commented 2 years ago

My understanding is that the control plane can only turn this on/off on a per-version basis. So we could instruct the control plane to set this to true for all versions >= 8.6 (for example), but that's not really a rolling change, as it impacts all clusters past a certain version.

FWIW, there is a LaunchDarkly service in the Elastic Cloud adminconsole codebase. I'm not 100% sure if it's planned to be used for rolling changes to kibana.yml, although, IMO, it would be a good use case :)

pgayvallet commented 2 years ago

Bah, whose idea was it to introduce a preboot mode with its own http server?

Please don't get me started on this :trollface:

Preboot isn't really needed on ESS, so we could conveniently ignore this for the time being, and accept the fact that the preboot http server may use a different CSP than the actual http server.

That's mostly a security concern here. If you're fine with it, I won't object.

I don't think this would be too problematic to refactor, but you'd know a lot better than I would.

I would need to take a closer look, but if we decide to omit the preboot server for this functionality, I don't see anything else technically strictly blocking.

The thing is, if we do implement a hook API for this new plugin to propagate this config value to Core, ideally this API should be called during the setup phase of the plugin. but given plugins lifecycles are synchronous, and because getVariation is async, I'm not sure this would be doable?

Which means the API would have to be called during start (and potentially after the http server is ready to serve requests given we'll be awaiting for getVariation), which doesn't sound ideal.

But once again, not technically strictly blocking.

WIW, there is a LaunchDarkly service in the Elastic Cloud adminconsole codebase. I'm not 100% sure if it's planned to be used for rolling changes to kibana.yml, although, IMO, it would be a good use case :)

I agree, if this is a possibility, it would be the most pragmatic and practical one to me.

afharo commented 2 years ago

The thing is, if we do implement a hook API for this new plugin to propagate this config value to Core, ideally this API should be called during the setup phase of the plugin. but given plugins lifecycles are synchronous, and because getVariation is async, I'm not sure this would be doable?

I'd like to add that, the whole point of A/B testing Feature Flags is that they can change dynamically. Based on targeting rules like (in trial or has data,) or changing the experiments, and/or bumping the rollout percentage like in this case.

If we decided to hook it to the core's config service, it should apply to the live updates observable config$ and plugins/services would need to subscribe to these updates.

Also, how would overrides work? Do Feature Flags always take preference over the settings in kibana.yml? The other way around? When designing the A/B testing service, we decided there isn't a one-solution-fits-all scenario and the complexity was not worth it. It's more explicit to let the plugins/services decide their preferred behavior by explicitly calling both APIs: the config service and Cloud Experiments. Although, we can revisit those assumptions if we find they are no longer valid.

If we want hot reloading for Feature Flags, we could extend the Cloud Experiments to provide an observable. The client allows subscribing to updates. It's a matter of exposing them in an API.

FWIW, there is a LaunchDarkly service in the Elastic Cloud adminconsole codebase. I'm not 100% sure if it's planned to be used for rolling changes to kibana.yml, although, IMO, it would be a good use case :)

I agree, if this is a possibility, it would be the most pragmatic and practical one to me.

The config value we are discussing here, csp.disableUnsafeEval, is static at the moment (loaded from the config only once and not subscribed to hot reloading updates). That implies restarting Kibana whenever it changes (unless we change our implementation). IMO, if all we want is to leverage this Feature Flag for progressive rollouts, it makes more sense to me to evaluate the rollout % when updating/deploying Kibana. WDYT?

But I guess we'd need to validate if this is the intended use for the LaunchDarkly implementation in the adminconsole. We could probably check with the Control Plane Applications or the Platform Deployment Management teams?

cc @arisonl as he's been involved in the LD discussions.

lukeelmers commented 1 year ago

If cloud_experiments is the right tool for this job, or if we should look into other ways of rolling out this change

Just stumbled across this issue. I am somewhat skeptical that LaunchDarkly is the tool for this job (or at least, that the current LD implementation in Kibana is the right way to handle this). I know it can be used for progressive rollouts, but we've mostly just been using it for A/B testing scenarios in the UI. It is not clear to me if the intent is to use it for canary deployments in the long term. It also feels a little weird that csp.disableUnsafeEval is basically already its own config-based feature flag, which we'd be looking to have LD override at runtime and dynamically change to another value.

More importantly, even if we could do it technically, I wonder if it is really worth the effort to add all of the necessary hooks into core, worry about sync/async lifecycles, inconsistency between the preboot and standard server, etc etc -- all as a workaround for handling this at the control plane level.

there is a LaunchDarkly service in the Elastic Cloud adminconsole codebase. I'm not 100% sure if it's planned to be used for rolling changes to kibana.yml, although, IMO, it would be a good use case

Agreed, if we are going to use LD, I think this approach would be preferable to handling it Kibana-side.

My understanding is that the control plane can only turn this on/off on a per-version basis

Control plane is also able to roll out changes per-region, which might be a better place to start and IMHO feels like a more natural solution than LD in Kibana or even LD in adminconsole (just my 2c)

legrego commented 1 year ago

Thanks everyone for your valuable and insightful feedback. In light of this, combined with @watson's progress on supporting inline partials in Handlebars, I no longer believe that we need an incremental rollout of this feature.

I'm going to close this issue, but will re-open if our needs change. Thanks again!