firecracker-microvm / firecracker

Secure and fast microVMs for serverless computing.
http://firecracker-microvm.io
Apache License 2.0
25.58k stars 1.78k forks source link

[RFC] File-based seccomp filters #2209

Closed alindima closed 3 years ago

alindima commented 3 years ago

This issue is meant to be an RFC for a new Firecracker feature - file-based seccomp filter customisation.

Please leave below any comments/questions you may have. Any feedback is welcome!

Here is the issue tracking this feature: https://github.com/firecracker-microvm/firecracker/issues/1366

What is the problem?

Firecracker is currently using the linux seccomp-bpf feature to limit the process from making dangerous or unneeded syscalls to the underlying host OS. This is a powerful security mechanism that is put in place after the necessary FC setup and just before running client code in the microVM.

Currently, the seccomp filters are hardcoded into the source code. This means that every modification in the filter would need a recompilation and potentially, a new release. This adds up to a lot of unnecessary work.

This feature would also enable any Firecracker user to build seccomp support for their own platform and libc toolchain. This should be particularly useful for solving glibc compatibility issues, where the list of required syscalls is highly dependent on the version.

By implementing this feature, we also enable the following seccomp improvements:

Proposed solution

We will expose a way for users to optionally provide custom seccomp filters for the Firecracker process, in the form of external files. These files will contain serialized binary BPF code that will get installed in the kernel.

Changes in Firecracker’s API

Currently, the --seccomp-level parameter takes three options: 0 (no filtering), 1 (basic filtering - matches on syscall names), 2 (advanced filtering - matches on syscall names and arguments, the default).

The proposal is to remove the --seccomp-level parameter, and replace it with a combination of two other parameters:

--no-seccomp # seccomp is disabled, equivalent to level 0
--seccomp-filter "/path/to/bpf" # custom filters, at the given path

As a default, when no parameters are specified, Firecracker will use the default, advanced filters, just as before (equivalent to level 2).

This implies removing the option of basic filtering (level 1). The level 1 option is still achievable via custom filter files.

These are breaking API changes, but will not impact customers that are using the default filters, without passing any additional seccomp arguments to Firecracker. We decided to replace the --seccomp-level argument in order to get the best UX (re-using the level parameter would be counter-intuitive, with options 0, 2 or path).

Seccompiler

We will build a new tool, called seccompiler, that will be a separate crate, with both a binary and library interface. We decided to implement the tool in the Firecracker repo, exporting another executable from the cargo workspace. We have the option of migrating the project to its standalone repository in the future, since it is more of a general-purpose jailing solution.

The seccompiler binary will receive as input a JSON file, containing the filters. It will output the corresponding binary file that contains the serialized BPF for Firecracker to consume.

The seccompiler library crate will be used by Firecracker to deserialize the binary BPF file and to install the filter, via helper functions.

In order to embed the generated BPF blobs in Firecracker’s source code, we will implement an automated mechanism based on cargo build scripts. On every modification in the JSON files, the script will automatically embed the resulting BPF code into Firecracker. This way, we will make sure that every binary has the right, most up-to-date allowlist.

The seccompiler binary crate will be part of the top-level cargo workspace and will be compiled using a command similar to this: cargo build -p seccompiler. We can also add a devtool command similar to: devtool build-seccompiler.

Seccompiler command line parameters

./seccompiler 
    --target-arch "x86_64"  # needed for embedding the arch validation in BPF
    --input-file "x86_64_musl.json" # name of the JSON input file
    --output-file "bpf_x86_64_musl" # optional, name of the output file

Releasing seccompiler

Since we are implementing seccompiler as a crate in the Firecracker repository, we will include its binary releases in Firecracker’s releases. This means having an additional release artifact, the compiler binary, for each supported target. For simplicity, its version will be the same as Firecracker’s version (the same way the jailer crate is managing versions).

We will also include the JSON filters used for building the Firecracker binary as release artifacts.

JSON file format

In regards to the file format, we decided to go with JSON, because of its popularity and expressiveness.

When designing the input file format, we strive for maximum flexibility and ease of use, while trying to keep things concise. For simplicity, the file is specific to one architecture and toolchain. This means that Firecracker needs to have a file for each supported target (determined by the supported arch-libc combinations). The default JSON files will live in the Firecracker repository.

In the file format we only use syscall names, not numbers, for usability reasons and to abstract away the architecture-specific syscall table.

At the top level, the file requires an object that maps thread categories (Vmm, Api & Vcpu) to seccomp filters:

{
    "Vmm": {
       "default_action": {
             "ERRNO" : -1
       },
       "filter_action": "ALLOW",
       "filter": [...]
    },
    "Api": {...},
    "Vcpu": {...},
}

The associated filter is a JSON object containing the default_action, filter_action and filter. The default_action represents the action we have to execute if none of the rules in filter matches, and filter_action is what gets executed if a rule in the filter matches (e.g: "Allow" in the case of implementing an allowlist).

An action is the JSON representation of the following rust enum:

pub enum SeccompAction {
    Allow, // Allows syscall.
    Errno(u32), // Returns from syscall with specified error number.
    Kill, // Kills calling process.
    Log, // Same as allow but logs call.
    Trace(u32), // Notifies tracing process of the caller with respective number.
    Trap, // Sends `SIGSYS` to the calling process.
}

The filter property specifies the set of rules that would trigger a match. This is an array containing multiple or-bound objects (if one of them matches, the corresponding action gets triggered).

These or-bound objects can refer either to one syscall or multiple syscalls:

  1. Multiple syscalls object:

This object is used to specify an action that is triggered by any syscall from a given list. You cannot specify any parameter checks, since it doesn’t make sense to have the same checks for different syscalls. The recommended use-case for this object is for quickly specifying obvious syscalls that all have the same on-match action.

The structure of the object is as follows:

{
    "syscalls": ["SYS_open", "SYS_exit"], // mandatory, vector of syscall names
    "action": "Allow", // optional, overrides the filter_action if present
    "comment": "Allowing obivous syscalls", // optional, for adding comments
}
  1. Single syscall object:

This object is used for adding a rule to a single syscall. It has an optional args property that is used to specify a vector of and-bound conditions that the syscall arguments must satisfy in order for the rule to match. In the absence of the args property, the corresponding action will get triggered by any call that matches that name, irrespective of the argument values.

Here is the general structure of the object:

{
    "syscall": "SYS_accept4", // mandatory, the syscall name
    "action": "Allow", // optional, overrides the filter_action if present
    "comment": "Used by vsock & api thread", // optional, for adding comments
    "args": [] // optional, vector of and-bound conditions for the parameters
}

In order to allow a syscall with multiple alternatives for the same parameters, you can write multiple syscall objects at the filter-level, each with its own rules.

A condition object is made up of the following mandatory properties:

We don’t support any named parameters, but only numeric constants in the JSON file. You may however add an optional comment property to each condition object. This way, you can provide meaning to each numeric value, much like when using named parameters, like so:

{
    "syscall": "SYS_accept4",
    "args": [
        {
            "arg_index": 3,
            "arg_type": "DWORD",
            "op": "Eq",
            "val": 1,
            "comment": "libc::AF_UNIX"
        }   
    ]
}

An example filter looks like this:

"Vmm": {
    "default_action": {
        "Errno": 2
    },
    "filter_action": "Allow",
    "filter": [
        {
            "syscall": "SYS_accept4",
            "action": "Allow",
            "comment": "Used by vsock & api thread"
        },
        {
            "syscalls": [
                "SYS_open",
                "SYS_exit",
                "SYS_close"
            ],
            "comment": "Obvious"
        },
        {
            "syscall": "SYS_fcntl",
            "comment": "used by snapshotting, drive patching and rescanning",
            "args": [
                {
                    "arg_index": 1,
                    "arg_type": "DWORD",
                    "op": {
                        "MaskedEq": 33
                    },
                    "val": 999
                },
                {
                    "arg_index": 2,
                    "arg_type": "DWORD",
                    "op": "Eq",
                    "val": 999
                }
            ]
        },
        {
            "syscall": "SYS_futex",
            "args": [
                {
                    "arg_index": 1,
                    "arg_type": "DWORD",
                    "op": "Eq",
                    "val": 789
                },
                {
                    "arg_index": 1,
                    "arg_type": "DWORD",
                    "op": "Eq",
                    "val": 567
                }
            ]
        },
        {
            "syscall": "SYS_futex",
            "args": [
                {
                    "arg_index": 2,
                    "arg_type": "DWORD",
                    "op": "Eq",
                    "val": 999
                }
            ]
        }
    ]
}
alindima commented 3 years ago

cc: @sameo

alindima commented 3 years ago

Closing this RFC. Initial PR with the implementation was merged on the feature/file_based_seccomp branch