Custom executable checks

andrewvc commented 4 years ago

Is your feature request related to a problem? Please describe.

People sometimes want to check a thing that is not an ICMP/TCP/HTTP endpoint. See https://github.com/elastic/uptime/issues/80 and some ERs from customers. Unfortunately their needs are diverse, for instance checking through SSH that a certain process is running, checking that Postgres database is alive, or running a custom executable and checking its output.

The only way to check this today would be to write an HTTP or TCP daemon that executed that check and translated it into an HTTP status.

Describe the solution you'd like

We should support the execution of custom binaries / scripts to enable users to make anything check-able via the Uptime UI via Heartbeat. We could support two flavors of this, both simple to implement:

Approach/Phase 1: exec check A simple check that judges up/down based on the exit status of the process, and optionally by matching a string in stdout/stderr.

Sample

heartbeat.monitors:
- type: exec
  command: "mycustomerchecker.sh"
  args: "--foo bar"
  schedule: "@every 1m"
  // A non-zero exit status also triggers failure
  require_output: "string I expect to be in output"

heartbeat.whitelist:
- /opt/mycustomchecker.sh

Approach/Phase 2: nagios check

A new check type based on the Nagios Plugin API to support that broad universe of plugins. We can potentially coordinate with @PhaedrusTheGreek who maintains https://github.com/PhaedrusTheGreek/nagioscheckbeat . He's enthusiastic about the idea of adding this functionality directly to heartbeat, and reports that there's well tested and factored code we could borrow here.

heartbeat.monitors:

type: nagios command: "mynagiosplugin" args: "--foo bar" schedule: "@every 1m"

heartbeat.whitelist:

/opt/mynagiosplugin

Describe alternatives you've considered

I've considered not using the Nagios plugin API, but rather a custom API. However, given the wide universe of nagios plugins, and the simplicity of the spec it would make for a good starting

Security Strategy

Executing arbitrary binaries over something that may be centrally configured, such as heartbeat, comes with some security risks. We should go in-depth here before implementing, defining a thorough threat model since the potential for attack here is high.

Some initial thoughts:

We need to whitelist what commands can be run.
1. One idea is to have a folder that is a sibling of heartbeat.yml, called say custom-checks, that we can check for the proper perms before using. Users could simply add scripts to this folder. We'd require that the perms in the directory only allow writes from users who are not the heartbeat user.
2. Having a whitelist config file, heartbeat.command-whitelist.yml, containing a list of commands heartbeat is allowed to execute as configured monitors. This should be a separate file so that users don't accidentally give users perms to write to it via fleet or central management.
Ensuring all externally executed commands run as a UID other than 0. Heartbeat is often installed as root to allow for ICMP ping. We'd need to run a wrapper before any executable (or perhaps simply invoke the sudo command), to insure that we are running as an alternate UID.

exekias commented 4 years ago

This super interesting, we have been discussed this option also for a while in the Metricbeat context

fearful-symmetry commented 4 years ago

This is a really great idea! At my last job we used nagios heavily, and it was universally despised for a number of reasons.

Setting the UID properly will be important. However, I would be wary of making whitelists/script config overcomplicated in the name of security. At my last job we had a huge and ever-growing list of custom script-based checks, if someone needs to tweak more than one or two things in order to migrate a given check, it'll be a pretty painful rollout process.

andrewvc commented 4 years ago

@fearful-symmetry thanks for the support!

You're right that a strict whitelist could be a stumbling block for a lot of users, so we'll have to do an in-depth threat model to determine what the right balance is here.

ph commented 4 years ago

I do understand the flexibility it provides, the possibilities of composition and the reusability of existing users scripts when deploying Heartbeat. But the security aspect worries me a bit, especially in the context of Central management or fleet.

Now there a few things in my mind.

Blocking that feature when the configuration is remotely managed but this effectively kill the feature in that deployment context. Which might be often the case with the move to remotely managed nodes.
"Accept List" or sandboxed folder is not really enough without having checksum/signature of the scripts we cannot enforce that what we are executing is really the "right" binaries and no tampering occurred. (even more serious with remote managing)
For macOS and windows which require us to sign binaries, this requirement might also reduce what can be executed or requiring a user to disable the security feature.
Seccomp policy will need to be changed to allow that.

Can I try to move the requirements?

Do we really want users to run arbitrary commands, scripts, binaries with diminished or root privileges?
Or we want an easier way for creating a monitor that doesn't require go?
Could we achieve that with a sandboxed API (js or lua based) (also vulnerable to other security issues)

Making secure the execution of an arbitrary process is hard.

andrewvc commented 4 years ago

@ph thanks for the considered response. I need some time to think about some of the security points. WRT a sandboxed API, there isn't a way I can see of using one to do the use cases mentioned in this issue such as:

SSHing to a remote server, then executing an arbitrary command there
Invoking a special tool and checking its result
Connecting to a remote database and checking its result using say psql

While it is true we could add client libraries for these tasks, the goal here is flexibility, the ability to say "Well, we don't support that directly, but you can do it yourself with a small script". For a lot of our audience, even if these problems were solvable with JS or Lua, those languages are beyond their technical ability.

andrewvc commented 4 years ago

I'll add here that Nagios has existed for a long time with this model, and there aren't complaints about its security, so I think it is doable.

We could also advise that when enabling this feature it be done in the context of a secure VM as well, where gaining root would have minimal impact.

ph commented 4 years ago

I'll add here that Nagios has existed for a long time with this model, and there aren't complaints about its security, so I think it is doable.

This is true for Nagios. Is there a remote management tools for that? I am concerned about the threat model here.

We could also advise that when enabling this feature it be done in the context of a secure VM as well, where gaining root would have minimal impact.

We can do that, yes.

Also, running in containers might reduce the risk and possible escalation.

I am going to give more thoughts about it.

exekias commented 4 years ago

It's clear that the combination of any kind of script execution and fleet represents an attack vector. That said, I think we can come up with good defaults that are safe enough, and allow the user to override them with the proper warnings. For instance:

Disable scripts by default when running on Fleet, require some specific setting on the client side to enable it.
Drop privileges and run scripts as a non-root user

ph commented 4 years ago

I think we need to consider this feature to possibly have the following impacts:

Privileges escalation.
Exposure of unwanted information.
DoS on a machine.
DoS on an external machine or network.

In the case of local configuration, I do not see problems with the above in having this feature when the admin with access on the machine takes the decision. We assume that each machine is secured independently. Now with the move to a remote configuration, we do increase the gain by compromising the control-plane (Kibana) or the data storage. So yes disabling that by default on both the Agent and on the Control plane needs to be done.

Drop privileges and run scripts as a non-root user

This can help with: 1, 2.

Maybe if we have a subcommand on the beats that you can do beats register myscripts --user X --group B which take more information (signature?) about the command and allow to be managed by the process. We can make it work, but I think we should add some kind of process even if its more cumbersome for the user.

I haven't look yet at other systems that permit this kind of behavior remotely.

There are limits in what we can secure by adding this execution model, other things we have to discuss as I've mentioned previously but I don't have all the answers:

How this fit with the signature of executables like in macOS catalina and Windows
How this fit with anti-malware execution mode on windows (self protection)

Also, let's say that we enable some of it to be controlled by the UI. We will certainly want some granularity on the users to either permit or not to add "script" to a data streams.

ph commented 4 years ago

cc @ruflin I think you need to be aware of that issue, see my points above.

ruflin commented 4 years ago

@andrewkroh Perhaps you can also chime in here because if I remember correctly you were thinking about something similar a few years ago.

andrewkroh commented 4 years ago

When we were last discussing a similar feature we were thinking at a minimum to have a security model similar to that of suEXEC (see the model at http://httpd.apache.org/docs/2.4/suexec.html). The only threat model we were considering was privilege escalation by a user that already had machine access.

In general there are more threat models to consider when pairing any Beat with Fleet as ph mentioned. For example, even without considering custom executable checks, there maybe URLs that Heartbeat should be prohibited from monitoring (via remote configuration), like link-local metadata services (leads to priv esc).

I think the idea of requiring the custom execution check to be registered locally along with some sandbox parameters could be effective. These sandbox parameters would be immutable via Fleet. For example you would register the script along with the restrictions that should be in place (like a list of paths it can access or a BPF filter expressing the allowed network egress traffic). (BTW systemd does sandboxing extremely well for things it executes. For some examples lookup IPAddressAllow, IPEgressFilterPath, ReadWritePaths).

andrewvc commented 4 years ago

@andrewkroh Would there be any issues if we favored convention over configuration? If we just had a folder full for whitelisted scripts at a directory that was /heartbeat/config/dir/scripts for instance.

Since that directory is only configurable from the CLI/init scripts it would be impossible to override via fleet AFAIK, and would be a lot less hassle then registering scripts etc. We would also make it impossible to write to via as log output (which is something we'd need to do regardless).

andrewkroh commented 4 years ago

I think it we only allow local scripts (no remote installation) and scope the choice of scripts to those in a dedicated directory this would help prevent an RCE. Someone that gets API access cannot remotely install and execute a script. And someone cannot execute arbitrary tools to give themselves a remote shell.

Consul had this problem at one point (it allows script checks for services). https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/

PhaedrusTheGreek commented 4 years ago

Nagios executables are usually installed by package manager into specific directories such as /usr/lib64/nagios/plugins. It make make sense for the whitelist setting to be an array.

ph commented 4 years ago

If we limit the number of scripts to local scripts, sanboxed to a directory, local whitelisting and not allow fleet to deliver the scripts to the machine. I think I will be OK with that.

I think we should also make it off by default in fleet and make it an opt-in feature, we could also make it opt-in per policy. @ruflin I think we have found a new beats specific configuration.

elastic / uptime

Custom executable checks #127