Open andrewvc opened 4 years ago
This super interesting, we have been discussed this option also for a while in the Metricbeat context
This is a really great idea! At my last job we used nagios heavily, and it was universally despised for a number of reasons.
Setting the UID properly will be important. However, I would be wary of making whitelists/script config overcomplicated in the name of security. At my last job we had a huge and ever-growing list of custom script-based checks, if someone needs to tweak more than one or two things in order to migrate a given check, it'll be a pretty painful rollout process.
@fearful-symmetry thanks for the support!
You're right that a strict whitelist could be a stumbling block for a lot of users, so we'll have to do an in-depth threat model to determine what the right balance is here.
I do understand the flexibility it provides, the possibilities of composition and the reusability of existing users scripts when deploying Heartbeat. But the security aspect worries me a bit, especially in the context of Central management or fleet.
Now there a few things in my mind.
Can I try to move the requirements?
Making secure the execution of an arbitrary process is hard.
@ph thanks for the considered response. I need some time to think about some of the security points. WRT a sandboxed API, there isn't a way I can see of using one to do the use cases mentioned in this issue such as:
While it is true we could add client libraries for these tasks, the goal here is flexibility, the ability to say "Well, we don't support that directly, but you can do it yourself with a small script". For a lot of our audience, even if these problems were solvable with JS or Lua, those languages are beyond their technical ability.
I'll add here that Nagios has existed for a long time with this model, and there aren't complaints about its security, so I think it is doable.
We could also advise that when enabling this feature it be done in the context of a secure VM as well, where gaining root would have minimal impact.
I'll add here that Nagios has existed for a long time with this model, and there aren't complaints about its security, so I think it is doable.
This is true for Nagios. Is there a remote management tools for that? I am concerned about the threat model here.
We could also advise that when enabling this feature it be done in the context of a secure VM as well, where gaining root would have minimal impact.
We can do that, yes.
Also, running in containers might reduce the risk and possible escalation.
I am going to give more thoughts about it.
It's clear that the combination of any kind of script execution and fleet represents an attack vector. That said, I think we can come up with good defaults that are safe enough, and allow the user to override them with the proper warnings. For instance:
I think we need to consider this feature to possibly have the following impacts:
In the case of local configuration, I do not see problems with the above in having this feature when the admin with access on the machine takes the decision. We assume that each machine is secured independently. Now with the move to a remote configuration, we do increase the gain by compromising the control-plane (Kibana) or the data storage. So yes disabling that by default on both the Agent and on the Control plane needs to be done.
Drop privileges and run scripts as a non-root user
This can help with: 1, 2.
Maybe if we have a subcommand on the beats that you can do beats register myscripts --user X --group B
which take more information (signature?) about the command and allow to be managed by the process. We can make it work, but I think we should add some kind of process even if its more cumbersome for the user.
I haven't look yet at other systems that permit this kind of behavior remotely.
There are limits in what we can secure by adding this execution model, other things we have to discuss as I've mentioned previously but I don't have all the answers:
Also, let's say that we enable some of it to be controlled by the UI. We will certainly want some granularity on the users to either permit or not to add "script" to a data streams.
cc @ruflin I think you need to be aware of that issue, see my points above.
@andrewkroh Perhaps you can also chime in here because if I remember correctly you were thinking about something similar a few years ago.
When we were last discussing a similar feature we were thinking at a minimum to have a security model similar to that of suEXEC (see the model at http://httpd.apache.org/docs/2.4/suexec.html). The only threat model we were considering was privilege escalation by a user that already had machine access.
In general there are more threat models to consider when pairing any Beat with Fleet as ph mentioned. For example, even without considering custom executable checks, there maybe URLs that Heartbeat should be prohibited from monitoring (via remote configuration), like link-local metadata services (leads to priv esc).
I think the idea of requiring the custom execution check to be registered locally along with some sandbox parameters could be effective. These sandbox parameters would be immutable via Fleet. For example you would register the script along with the restrictions that should be in place (like a list of paths it can access or a BPF filter expressing the allowed network egress traffic). (BTW systemd does sandboxing extremely well for things it executes. For some examples lookup IPAddressAllow
, IPEgressFilterPath
, ReadWritePaths
).
@andrewkroh Would there be any issues if we favored convention over configuration? If we just had a folder full for whitelisted scripts at a directory that was /heartbeat/config/dir/scripts
for instance.
Since that directory is only configurable from the CLI/init scripts it would be impossible to override via fleet AFAIK, and would be a lot less hassle then registering scripts etc. We would also make it impossible to write to via as log output (which is something we'd need to do regardless).
I think it we only allow local scripts (no remote installation) and scope the choice of scripts to those in a dedicated directory this would help prevent an RCE. Someone that gets API access cannot remotely install and execute a script. And someone cannot execute arbitrary tools to give themselves a remote shell.
Consul had this problem at one point (it allows script checks for services). https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/
Nagios executables are usually installed by package manager into specific directories such as /usr/lib64/nagios/plugins
. It make make sense for the whitelist setting to be an array.
If we limit the number of scripts to local scripts, sanboxed to a directory, local whitelisting and not allow fleet to deliver the scripts to the machine. I think I will be OK with that.
I think we should also make it off by default in fleet and make it an opt-in feature, we could also make it opt-in per policy. @ruflin I think we have found a new beats specific configuration.
Is your feature request related to a problem? Please describe.
People sometimes want to check a thing that is not an ICMP/TCP/HTTP endpoint. See https://github.com/elastic/uptime/issues/80 and some ERs from customers. Unfortunately their needs are diverse, for instance checking through SSH that a certain process is running, checking that Postgres database is alive, or running a custom executable and checking its output.
The only way to check this today would be to write an HTTP or TCP daemon that executed that check and translated it into an HTTP status.
Describe the solution you'd like
We should support the execution of custom binaries / scripts to enable users to make anything check-able via the Uptime UI via Heartbeat. We could support two flavors of this, both simple to implement:
Approach/Phase 1: exec check A simple check that judges up/down based on the exit status of the process, and optionally by matching a string in stdout/stderr.
Sample
Approach/Phase 2: nagios check
A new check type based on the Nagios Plugin API to support that broad universe of plugins. We can potentially coordinate with @PhaedrusTheGreek who maintains https://github.com/PhaedrusTheGreek/nagioscheckbeat . He's enthusiastic about the idea of adding this functionality directly to heartbeat, and reports that there's well tested and factored code we could borrow here.
heartbeat.monitors:
heartbeat.whitelist:
Describe alternatives you've considered
I've considered not using the Nagios plugin API, but rather a custom API. However, given the wide universe of nagios plugins, and the simplicity of the spec it would make for a good starting
Security Strategy
Executing arbitrary binaries over something that may be centrally configured, such as heartbeat, comes with some security risks. We should go in-depth here before implementing, defining a thorough threat model since the potential for attack here is high.
Some initial thoughts:
heartbeat.yml
, called saycustom-checks
, that we can check for the proper perms before using. Users could simply add scripts to this folder. We'd require that the perms in the directory only allow writes from users who are not the heartbeat user.sudo
command), to insure that we are running as an alternate UID.