Closed gmmorris closed 4 years ago
Pinging @elastic/kibana-alerting-services (Team:Alerting Services)
I've done some early investigation and it seems there are a lot of moving parts to exempting the older Alerts from RBAC all together or over their Action execution, but I think I've found a path we could explore further, I just want to run it by the team before actually trying to implement it.
First off, regarding option 3 of a migrating path, I'm not actively investigating that, as it sounds like there's no way we could rush that kind of thing for 7.10 (unless @legrego can pull a rabbit out of a hat).
Regarding options 1 & 2, it looks like either way we go, we would need to add arguments to the Actions client that could potentially be abused to bypass RBAC in other cases. I'd like to try and make option 1 work, as it means we only exempt RBAC in the minimal needed case in order for Alerts that were created by Solution users inside that solution to continue working after the upgrade.
As we won't be able to remove this exemption until 8.0.0 (which means both the potential security hole and the code debt impact will be around for a while), I'm looking for a pathway that will allow the ActionsClient itself to make the assessment as to whether a certain execution should be exempt from RBAC.
To achieve that, I don't think we can rely on the AlertsClient telling the ActionsClient that a specific execution is exempt, because that could be used by any and every other plugin in the future to get around security checks - a clear security hole. Instead, what I'm exploring is a relationship where we use the SavedObjects references to connect between an action params and a related SavedObject that is the source of the execution- in our case, the scheduling Alert.
This way, the ActionsClient can look at the source Saved Object type and ask the associated plugin (in our case, the Alerts plugin) whether the user should be allowed to execute the action without applying RBAC to the get
and the execute
operations. This model means that it'll be harder to use the security hole , as you'd need to somehow trick both the source plugin and the actions plugin into believing you're allowed to exempt this user (rather than making as easy as an argument to execute
that says no security please).
This approach has a couple of complications to consider:
alert
SO type with the Actions client (much like we've already done for the Event Log), which is more complexity to maintain long term, but also opens up future extensibility where plugins can provide custom configurations for the actions executed on their behalf. (I'm not sure if we have future requirements that could piggy back off of this, please give a shout if you know of any).meta
field which we can add attributes to in order to track properties of the alert. Such as setting rbac
to legacy
as part of the 7.10 migration. We can then set rbac
to be the current version or omit it all together when the alert is updated which will mean from that moment on it'll no longer be exempt. We will include the meta
object in AAD, which will prevent users from being able to manually hack their Alerts out of RBAC.For example imagine this Alert SavedObject:
{
"id": "40596401-0658-4f91-81ea-d55de42d8bd9",
"type": "alert",
// ...
"attributes": {
// ...
"meta": {
"rbac": "lecagy"
},
},
"references": [],
"migrationVersion": {
"alert": "7.10.0"
}
}
We wouldn't expose meta
to the outside world, it would only be visible to the Alerts plugin (on RawAlert
) and omitted from the exposed Alert
type.
If we decide not to take the path of attaching a reference from the action params back to the alert when it's scheduled, then we would need to rely on the Alert to tell us (when it schedules the action execution) whether to apply RBAC or not. The problem with this is that there's nothing preventing other plugins from doing this whenever they schedule actions for execution and that seems like a rather blatant security hole that is very easy to abuse.
If we take this approach we'll need to add an argument to ActionsClient.enqueueExecution
and to ActionsClient.get
to allow it to skip the RBAC for the execute
operation, and we'll also need to add something to the alerts params Saved Object telling the task runner to tell the actions client to skip RBAC when calling get
. Both these could be used by any plugin to totally skip RBAC when get
ing an action SO or enqueuing execution.. which I'm very uncomfortable with (but I felt I had to suggest it anyway, so we can make an informed decision).
Point number 3 above is relevant to this one too - we'll have to migrate every Alert and mark it as well.
Both the approaches I've laid out above will achieve the same end result in terms of the Alerts. Here is the table from the original issue https://github.com/elastic/kibana/issues/70851 with an additional column describing the situation after this work around:
Scenario | What Will Happen if we don't change anything? | What will happen with this work around? |
---|---|---|
A user with Custom All access to the Metrics solution created an Index Threshold alert through Alerts Management in 7.7/8 | As they are not granted any privileges to the Actions & Built-In Alerts features, their Api Key will not have the required privileges and the Alert will break when it tries to run | This will still be broken after the upgrade as we do not exempt the Alerts themselves from RBAC and the user won't have the ability to access this Alert. |
A user with Custom All access to the Metrics solution created a Metrics Inventory alert through the Metrics solution in 7.7/8/9 with an Email action on it |
The user's privileges to Metrics applies to the Metrics alerts, so they can run the Alert, but as they are not granted any privileges to the Actions feature, their Api Key will not have the required privileges to execute any actions that have been configured for the alert. The Alert itself will run, as it relies on their privileges to Metrics, but when the actions are scheduled, they will fail to execute. | This Alert will continue to work as it did before the upgrade. This is because we will exempt the Action scheduling and execution from RBAC, allowing the alert to continue to work. The user won't be able to change the alert until they're granted access to the Actions feature, but once they have that and modify the Alert, a new ApiKey will be applied and the exemption will be removed. |
A user with Base Read permissions created an Alert in 7.7/8/9 | The user will inherit Read privileges to the Actions & Built-In Alerts features, which means that their existing alerts will continue to work and their actions will continue to execute, but they will no longer be able to modify them | Same as before and that's a good thing, as this is by design. |
The bottom line is this - this remediation will ensure that any Alert created by a Solution user inside of a Solution that they have been granted privileges to, will continue to work as expected after the upgrade, but those users will have their abilities limited until they are granted access to the Actions plugin by an administrator.
The down side is we'll have additional edge cases to support going forward, and we'll have to find ways to automate the testing of "permissions-post-migrations" which is something we do not seem to currently have a way of doing. Automating this is critical to ensure we don't accidentally break this exemption unintentionally in the future (it would be enough for us to add another get
for an SO somewhere in our execution workflow to accidentally break the assumptions behind this security exemption, and we need to be very careful going forward.
This is technical dept we'll willingly incur until 8.0.0 in order to support our solutions, but it's important to recognise that it doesn't come without a cost.
use the SavedObjects references to connect between an action params and a related SavedObject that is the source of the execution- in our case, the scheduling Alert.
I think we're going to want something like this eventually anyway. We really need it for event log, so when we execute an action we can identify the "source" - which today is only alerts or http request - and add it in the action execution event document. Today, we can't trace back from an action execution to the alert that caused it, directly.
One reason I've deferred implementing this is it's not clear what the "source" could be, long term, and I'm not sure the best way to model both "an alert saved object" and "an http request" in the same piece of data.
My plan had been to pass something "extra" to the action executor, somehow, to indicate this source, which sounds exactly what you've figured out - action execution params feels like the right place (where else?!?!).
Yup, agree with all of those thoughts. :) Thanks @pmuellr
My plan had been to pass something "extra" to the action executor, somehow, to indicate this source, which sounds exactly what you've figured out - action execution params feels like the right place (where else?!?!).
The Action Params ar ethe only thing that really connects to an execution so that feels like the right place as far as I can tell.
I've been thinking more recently that we probably want to change our "alert execution and subsequent action execution" model where we queue actions to run during alert execution. Longer term, we need to be able to pipe data from one action into another (eg, create github issue, and include a link to that issue in a subsequent email or slack message). For reference, today we just queue up all the actions that are going to be run for an action group, in parallel. Now imagine trying to serialize that queued execution - we will need some kind of "step runner" to manage this. "It's complicated."
You know what would be a lot easier? Just run the actions in the same task manager task as the alert executor. This makes our task usage coarser - instead of many fine-grained tasks, we'll have fewer big hairy tasks.
Lots of potential sharp edges here, but one thing I'm wondering about is: would that help in this case, if all the action executors were run in-process with the alert executor?
One reason I've deferred implementing this is it's not clear what the "source" could be, long term, and I'm not sure the best way to model both "an alert saved object" and "an http request" in the same piece of data.
Yeah, this had occurred to me. I don't think utilising the references as they are prevents us from migrating this to another place at a later point if we need to. With that in mind, this feels like an approach that won't lock us in in the future.
Any objection?
Lots of potential sharp edges here, but one thing I'm wondering about is: would that help in this case, if all the action executors were run in-process with the alert executor?
Yes, it would help alot, but that feels like a much bigger change to rush through on a 7.10 blocker :/
I don't think utilising the references as they are prevents us from migrating this to another place at a later point if we need to. With that in mind, this feels like an approach that won't lock us in in the future.
Ya, WORKSFORME. Event log already has slots for SO's space/type/id, that are used when we have SO's to reference. I've been thinking we can come up with some kind of "extension" pattern for things that aren't SO's, using spaceIds &| types which aren't valid for SO's, or perhaps an additional field indicating additional types of references, in the nested properties that holds the SO's. Eg, when an action is executed via HTTP request, the SO references in the event would include a SO reference that somehow indicates it was via HTTP, perhaps including an HTTP request id, etc.
Lots of potential sharp edges here, but one thing I'm wondering about is: would that help in this case, if all the action executors were run in-process with the alert executor?
Yes, it would help alot, but that feels like a much bigger change to rush through on a 7.10 blocker :/
Yes, we've done literally no scoping/sizing for this sort of change - I'd guess it would greatly simplify the inner workings of alerting/actions, and allow us to remove some code / artifacts, but will also mean we have to write some new code, and more importantly figure out the hard parts like ... retries and timeouts.
Thought I'd throw it out though.
Have we considered whether it would be possible to run some kind of "fix-up script" after a migration to 7.10, to ... fix up things? Eg, something a customer would run against their own deployment. I'm not sure I remember hearing any mention of this. And the script could end up just being an http entrypoint we add to alerting, where all the "stuff" happens in the route handler.
I have no idea what this script might actually do, perhaps given the problem, it's not really possible.
It's certainly not the kindest way to deal with this issue, but if such a script were possible, it would mean not having to deal with "legacy" issues in the main code base. And I think with the view that alerting is still "beta", it's not an unreasonable way to solve the problem (if solvable this way).
Rather than an rbac
property in meta
(as proposed above), how about just a versionCreated
property, which could be generally useful for other purposes. A missing field would indicate pre 7.10.0. Slightly sad the migrationVersion
can't be re-used here, I think it only gets populated on an actual migration? Or maybe that would be good enough for this purpose?
same structure as above, but with versionCreated
instead of rbac
:
{
"id": "40596401-0658-4f91-81ea-d55de42d8bd9",
"type": "alert",
// ...
"attributes": {
// ...
"meta": {
"versionCreated": "7.10.0" // field not present for < v 7.10
},
},
"references": [],
"migrationVersion": {
"alert": "7.10.0"
}
}
Have we considered whether it would be possible to run some kind of "fix-up script" after a migration to 7.10, to ... fix up things? Eg, something a customer would run against their own deployment. I'm not sure I remember hearing any mention of this. And the script could end up just being an http entrypoint we add to alerting, where all the "stuff" happens in the route handler.
I have no idea what this script might actually do, perhaps given the problem, it's not really possible.
It's certainly not the kindest way to deal with this issue, but if such a script were possible, it would mean not having to deal with "legacy" issues in the main code base. And I think with the view that alerting is still "beta", it's not an unreasonable way to solve the problem (if solvable this way).
Not something the Alerting team could do - this would require some kind of Privileges and Features migrations and there's nothing like that on the books for @elastic/kibana-security as far as I'm aware. It's also worth noting - we don't really know who the user who created and Alert is, let alone whether they should have access to the Actions feature or not... So it's not just about migrating... it's that we don't know who to migrate. 🤷
Rather than an
rbac
property inmeta
(as proposed above), how about just aversionCreated
property, which could be generally useful for other purposes. A missing field would indicate pre 7.10.0. Slightly sad themigrationVersion
can't be re-used here, I think it only gets populated on an actual migration? Or maybe that would be good enough for this purpose?same structure as above, but with
versionCreated
instead ofrbac
:{ "id": "40596401-0658-4f91-81ea-d55de42d8bd9", "type": "alert", // ... "attributes": { // ... "meta": { "versionCreated": "7.10.0" // field not present for < v 7.10 }, }, "references": [], "migrationVersion": { "alert": "7.10.0" } }
I did investigate in that direction.
First thing to keep in mind, it's not versionCreated
but rather versionLastUpdated
- and I did consider that instead of rbac
, but realised it seems like a weird seemingly unnecessary duplication of migrationVersion
.
We can't use migrationVersion
as an both Alert created in 7.10 and alerts alert migrated as part of 7.10 will have the exact same migrationVersion
😫
The reason I'd prefer not to go with an empty value equalling exemption is that it means that the default path is to exempt from RBAC- which feels less secure and actually harder to maintain further down the line. I'd rather the exempt alerts are special cased based on additional existing data, rather than simply exempt by default whenever we forget to set the version.
That said, we can migrate the old ones to just be:
versionLastUpdated: "7.9.0"
(even thought it isn't necessarily true) or versionLastUpdated: "pre-7.10.0"
and then use that value to decide whether to exempt the user.
I don't mind going with the version rather than the rbac:legacy
approach, it's six of one and half a dozen of the other as far as I'm concerned. I'm just looking for a way of asking "should I exempt this alert from RBAC?"
we don't really know who the user who created and Alert is
In theory we do know who created and last updated alerts - but not actions - via these fields in the SO:
I doubt that gets us very far here though.
In lieu of some magic "fixer-upper" script, would some kind of script be helpful that could analyze their current alerts / actions, with some advice on how to "fix" things? I often see complex tools offer a "doctor" tool that does this kind of stuff.
I'm sensitive to helping customers who have been using alerts, to keep their alerts running when they install 7.10 (or other future versions), but at the same time don't want to have to overly complicate our code to have to deal with our beta issues for a looong time (forever? what happens to a customer who migrates from 7.8 to 8.0?), so grasping at straws here! :-)
I'm sorry to say I've been back and forth on this with Security, and I'm afraid it doesn't sound like there' something we can provide to do this... 🤷
This is going to sound strange coming from the security team, but I'm honestly not opposed to exploring The less secure option, but far simpler approach
.
If the only objection here is the ability for other plugins to bypass security controls, then I'm comfortable with that tradeoff. We have a similar mechanism in place for the saved objects client, where you can create one without the Security Wrapper, thereby removing RBAC from the equation. It takes an explicit action do to this though, and isn't something that can be done accidentally. We can help educate consumers by giving the configuration option a scary name, akin to React's dangerouslySetInnerHTML
.
I'm also not concerned about third-party plugins at this point either. If an attacker has the ability to install an arbitrary plugin, then they can do far more damage than just bypassing this specific check.
First off, regarding option 3 of a migrating path, I'm not actively investigating that, as it sounds like there's no way we could rush that kind of thing for 7.10 (unless @legrego can pull a rabbit out of a hat). Not something the Alerting team could do - this would require some kind of Privileges and Features migrations and there's nothing like that on the books for @elastic/kibana-security as far as I'm aware.
Nothing on the books, but I've been giving this some thought as time permits. I have a placeholder issue here which I need to flesh out more, but you're right that this isn't something we'd be able to pull off for 7.10: https://github.com/elastic/kibana/issues/68814
Sorry for the delay in updates but I'm now back from my travels :)
I have just opened a PR for the Alerting team to review which I hope will address this issue. The broad strokes are as follows:
The key challenge in exempting legacy alerts from RBAC is that Alerts and Actions are, by design, two independent plugins. This means that allowing Alerts to tell Actions to exempt certain operations from RBAC could open up the ability for any plugin to exempt an operation from RBAC, which opens up a severe backdoor in our Actions plugin. The alternative would be for Actions to have knowledge of legacy Alerts and have it make the decision independently. This is a far more secure approach, but it could introduce technical debt which we could slow us down in the future (as it forces us to spread the domain knowledge of Alerts into Actions, and creates a dependency between the two that we now need to manage).
Discussing the options with members of the team, the feeling was that the more secure approach is the preferable one, despite the additional cognitive load and technical debt it introduces.
To that effect, the PR introduces the following:
meta
object on the Alerts SavedObject which will hold internal data about the alert that we do not wish to expose. It will not be visible to users of the framework and cannot be updated by them, and it is included in the AAD of the SavedObjects, which means that if anyone tries to fiddle with that data it will essentially break the alert. As part of the migration to 7.10.0 we now add a field called versionApiKeyLastmodified
on the meta
object with the value: pre-7.10.0
, to designate them as Legacy. This means we will now be migrating every single Alert as part of the upgrade to 7.10.0, which is less ideal, but is the safer option (as explained here: https://github.com/elastic/kibana/issues/74858#issuecomment-677550550 )source
argument to the ActionClient.execute
and ActionClient.enqueueExecution
methods which describes the source of execution of the action. This can be either an HTTP request (KibanaRequest
) or a Saved Object reference, which is used to allow us to map an action execution back to its origin. This is something we have been discussing adding anyway to improve our event logging and can be leveraged for this in a future PR. In the case of a SO source, this is saved on the action_task_params
object as a standard SO reference. In the case of an HTTP request we do not store it in anyway, but I didn't want to leave a half implementation here where we can specify a SO source but not an HTTP request, even if it is only available up until the action_task_params
is created.action_task_params
. If this SO is of any type other than an Alert
then it continues as normal. If it is an Alert, it fetches the SavedObject using the unsecuredSavedObjectsClient
(which means, no permissions will be checked on this get
) and checks to see if it is a Legacy alert. If it is a Legacy alert, then the ActionsClient initialised by the TaskRunner is marked as shouldUseLegacyRbac = true
. This tells it to skip authorization on the execute
and get
operations - enabling legacy Actions to continue to run whether the attached ApiKey is allowed to do so.versionApiKeyLastmodified
field on meta
which means that once a Legacy alerts is updated, it will no longer be considered Legacy. This also means that only a user with the required permissions (All
for both the Alerting related plugin , which changes depending on the specific Alert, and the Actions
) will be able to update the Alert and from that moment on the alert will run under that user's permissions.My testing suggests this will allow all the alerts created prior to 7.10.0 to continue to run, as long as the user who created them is still privileged to the underlying Alert. This means that if a user somehow created a SIEM rule without having privileges to SIEM, then their Alert will not work after the upgrade. This also means that any Index Threshold alert created by a user with custom privileges will remain unable to run after the upgrade, as they would need to explicitly have privileges to the Built-In Alerts feature for them to run. This is by design and seems like a reasonable end state to be at in regards to mitigating the impact of introducing RBAC into the system.
Just a couple of thoughts for moving forward:
This is going to sound strange coming from the security team, but I'm honestly not opposed to exploring
The less secure option, but far simpler approach
.If the only objection here is the ability for other plugins to bypass security controls, then I'm comfortable with that tradeoff. We have a similar mechanism in place for the saved objects client, where you can create one without the Security Wrapper, thereby removing RBAC from the equation. It takes an explicit action do to this though, and isn't something that can be done accidentally. We can help educate consumers by giving the configuration option a scary name, akin to React's
dangerouslySetInnerHTML
.I'm also not concerned about third-party plugins at this point either. If an attacker has the ability to install an arbitrary plugin, then they can do far more damage than just bypassing this specific check.
First off, regarding option 3 of a migrating path, I'm not actively investigating that, as it sounds like there's no way we could rush that kind of thing for 7.10 (unless @legrego can pull a rabbit out of a hat). Not something the Alerting team could do - this would require some kind of Privileges and Features migrations and there's nothing like that on the books for @elastic/kibana-security as far as I'm aware.
Nothing on the books, but I've been giving this some thought as time permits. I have a placeholder issue here which I need to flesh out more, but you're right that this isn't something we'd be able to pull off for 7.10: #68814
Thanks Larry :) Just for posterity - the chosen approach (as specified in the comment above) fell somewhere in the middle between the two options.
The less secure option didn't actually prove possible in the end, as the actual execution of actions happens asynchronously to request to execute, making it impossible to tell the Actions client to skip auth without also storing that exemption in some manner. The chosen approach means that the thing we store (the source) is of value beyond exempting the legacy alerts from RBAC, as we can use it for our Event Logging and future features, and also addresses the RBAC issue.
Additionally, it feels pretty secure and equally important - it doesn't entice plugins using the Actions plugin by offering them an easy way to skip security (which TBH was my bigger concern ;) ).
Following up from the https://github.com/elastic/kibana/issues/70851 issue, where we discussed the fact that under certain scenarios existing alerts will stop working after the 7.10 upgrade, we need to investigate way to mitigate this.
A few ideas to explore: