anitsh / til

Today I Learn (til) - Github `Issues` used as daily learning management system for taking notes and storing resource links.
https://anitshrestha.com.np
MIT License
78 stars 11 forks source link

Linux Security Infrastructure #125

Open anitsh opened 4 years ago

anitsh commented 4 years ago

Linux Security Infrastructure

In a normal operating system (OS), every application is unmonitored and it is difficult to determine what is happening in a system.

image Privilege rings for the x86 microprocessor architecture available in protected mode. Operating systems determine which processes run in each mode.

'''Enforcing security goes in hand with knowing what you are protecting yourself against, or at least what you are protecting. All that, however, must start with a security policy. Once the policy is formulated, the choice is easier to make. Let's consider a few cases, which may or may not reflect your use cases. But before considering them, it may be helpful to narrow down the available options to the following:

If you cannot modify the target program, the only options you have are MAC --Mandatory Access Control-- (at best), since all you need is provide a security policy by which a given program will be evaluated. No matter the program, no matter the implementation language, since the operating system's security layers take care of it. Sandboxes are another option, but they fall in category 3.
If you are [re-]writing (or can modify) your program: good luck then, because you actually can write a security-aware program by invoking certain security primitives provided by a framework of your choice [more below]. However, you may be limited in that the framework you choose does not have any bindings in your programming language, as most of them tend to be low-level. Is it C, Java, Go, or Javascript (yeah, as if!)? You may be on your own in most cases, so, welcome to the club. But there are options. And silly ones sometimes.
You don't care and just want to keep things from exploding in your face (i.e. none of the above is worth your time): sandboxes then? Maybe something as dramatic as a VM, or if you are in the mood for the unknown, application containers are your best friends. However, the most security-savvy minds out there do warn that containers do not contain. But you still at least benefit from some resource isolation without having to know the internals of your programs. Unprivileged containers are strongly recommended. But otherwise, VMs are still cool.

And now, the cases: I want to run a program in a way that is immune to malicious exploitation: since that means any attack, known or unknown, nothing is guaranteed to satisfy your requirement (every security framework should tell you this). But at least, restricting your program to the fewest possible system calls reduces its attack surface, although you can still be attacked from anywhere even if it's a 1% of the possible ways. However, if stripping privileges off applications to make them less useful to the attacker, system call filtering and capability systems will be helpful. Seccomp (available on Linux), Capsicum (available on FreeBSD, and soon on Linux), and POSIX capabilities are options here. I want to restrict my programs to known/expected behavior: easy (for simple programs) - if you can define runtime behavior in terms of accessed files or kernel objects, then MAC frameworks can help you (AppArmor, SELinux, ...). However, you will also have to make choices as to what level of abstraction you want to express your policies, so preciseness and flexibility will be pulling you in different directions (are paths precise enough, are inodes manageable for you, what about memory segments?). I often find myself needing to run programs with elevated privileges and worry it may be too risky: we've all been there. Dropping capabilities is probably the most reasonable option. It is OS-specific (and so are all the others), but at least implementation-independent if done as part of access control on a given machine (as opposed to invoking the capabilities interface programmatically, which will depend on available bindings in your language).'''

Linux Security Modules (LSM)

Linux Security Modules (LSM) is a framework that allows the Linux kernel to support a variety of computer security models while avoiding favouritism toward any single security implementation. AppArmor, SELinux, Smack, and TOMOYO Linux are the currently accepted modules in the official kernel.

Mandatory Access Control (MAC)

A type of access control by which the operating system or database constrains the ability of a subject or initiator to access or generally perform some sort of operation on an object or target.[1] In the case of operating systems, a subject is usually a process or thread; objects are constructs such as files, directories, TCP/UDP ports, shared memory segments, IO devices, etc. Subjects and objects each have a set of security attributes. Whenever a subject attempts to access an object, an authorization rule enforced by the operating system kernel examines these security attributes and decides whether the access can take place. Any operation by any subject on any object is tested against the set of authorization rules (aka policy) to determine if the operation is allowed. A database management system, in its access control mechanism, can also apply mandatory access control; in this case, the objects are tables, views, procedures, etc.

With mandatory access control, this security policy is centrally controlled by a security policy administrator; users do not have the ability to override the policy and, for example, grant access to files that would otherwise be restricted. By contrast, discretionary access control (DAC), which also governs the ability of subjects to access objects, allows users the ability to make policy decisions and/or assign security attributes. (The traditional Unix system of users, groups, and read-write-execute permissions is an example of DAC.) MAC-enabled systems allow policy administrators to implement organization-wide security policies. Under MAC (and unlike DAC), users cannot override or modify this policy, either accidentally or intentionally. This allows security administrators to define a central policy that is guaranteed (in principle) to be enforced for all users.

Mandatory Access Controls

A MAC is a framework for defining what a program can and cannot do, on a whitelist basis. A program is represented as a subject. Anything the program wants to act on, such as a file, path, network interface, or port is represented as an object. The rules for accessing the object are called the permission, or flag. Take the AppArmor policy for the ping utility, with added comments:

#include <tunables/global>

/bin/ping {
  # use header files containing more rules
  #include <abstractions/base>
  #include <abstractions/consoles>
  #include <abstractions/nameservice>

  capability net_raw,  # allow having CAP_NET_RAW
  capability setuid,   # allow being setuid
  network inet raw,    # allow creating raw sockets

  /bin/ping mixr,      # allow mmaping, executing, and reading
  /etc/modules.conf r, # allow reading
}

With this policy in place, the ping utility, if compromised, cannot read from your home directory, execute a shell, write new files, etc. This kind of sandboxing is used for securing a server or workstation. Other than AppArmor, some popular MACs include SELinux, TOMOYO, and SMACK. These are typically implemented in the kernel as a Linux Security Module, or LSM. This is a subsystem under Linux that provides modules with hooks for various actions (like changing credentials and accessing objects) so they can enforce a security policy.

Discretionary access Control (DAC)

A type of access control defined by the Trusted Computer System Evaluation Criteria[1] "as a means of restricting access to objects based on the identity of subjects and/or groups to which they belong. The controls are discretionary in the sense that a subject with a certain access permission is capable of passing that permission (perhaps indirectly) on to any other subject (unless restrained by mandatory access control)".

Discretionary access control is commonly discussed in contrast to mandatory access control (MAC). Occasionally a system as a whole is said to have "discretionary" or "purely discretionary" access control as a way of saying that the system lacks mandatory access control. On the other hand, systems can be said to implement both MAC and DAC simultaneously, where DAC refers to one category of access controls that subjects can transfer among each other, and MAC refers to a second category of access controls that imposes constraints upon the first.

Kernel Security Tools

OR

Other

Chrooting bash, for example, would involve putting any executables and libraries it needs into the new directory, and running the chroot utility (which itself just calls the syscall of the same name):

host ~ # ldd /bin/bash
        linux-vdso.so.1 (0x0000036b3fb5a000)
        libreadline.so.6 => /lib64/libreadline.so.6 (0x0000036b3f6e5000)
        libncurses.so.6 => /lib64/libncurses.so.6 (0x0000036b3f47e000)
        libc.so.6 => /lib64/libc.so.6 (0x0000036b3f0bc000)
        /lib64/ld-linux-x86-64.so.2 (0x0000036b3f938000)
host ~ # ldd /bin/ls
        linux-vdso.so.1 (0x000003a093481000)
        libc.so.6 => /lib64/libc.so.6 (0x000003a092e9d000)
        /lib64/ld-linux-x86-64.so.2 (0x000003a09325f000)
host ~ # mkdir -p newroot/{lib64,bin}
host ~ # cp -aL /lib64/{libreadline,libncurses,libc}.so.6 newroot/lib64
host ~ # cp -aL /lib64/ld-linux-x86-64.so.2 newroot/lib64
host ~ # cp -a /bin/{bash,ls} newroot/bin
host ~ # pwd
/root
host ~ # chroot newroot /bin/bash
bash-4.3# pwd
/
bash-4.3# ls
bin  lib64
bash-4.3# ls /bin
bash  ls
bash-4.3# id
bash: id: command not found

Only a process with the CAP_SYS_CHROOT capability is able to enter a chroot. This is necessary to prevent a malicious program from creating its own copy of /etc/passwd in a directory it controls, and chrooting into it with a setuid program like su, tricking the binary into giving them root.

Based on what they do:

Overall, these products can be grouped into ones focused on enforcement vs auditing. Both groups define a policy that describes the allowed or disallowed behavior for a process, in terms of system calls, their arguments, and host resources accessed. Enforcement tools use the policy to change the behavior of a process by preventing system calls from succeeding, or in some cases, killing the process. Seccomp, seccomp-bpf, SELinux, and AppArmor are examples of enforcement tools. Auditing tools use the policy to monitor the behavior of a process and notify when its behavior steps outside the policy. Auditd and Falco are examples of auditing tools. (Falco does allow taking actions on alerts via its command execution notification channel, so it has limited enforcement capabilities, but it is not intended to be used as an enforcement tool).

Sandboxing

At its most basic, sandboxing is a technique to minimize the effect a program will have on the rest of the systems in the case of malice or malfunction. This can be for testing or for enhancing the security of a system. The reason one might want to use a sandbox also varies, and in some cases it is not even related to security, for example in the case of OpenBSD's systrace. The main uses of a sandbox are:

There are many sandboxing techniques, all with differing threat models. Some may just reduce attack surface area by limiting APIs that can be used, while others define access controls using formalized models similar to Bell-LaPadula or Biba.

Resource

Related: #112 #427

anitsh commented 3 years ago

Linux Capabilities

Capabilities break up root privileges in smaller units, so root access is no longer needed. Most of the binaries that have a setuid flag, can be changed to use capabilities instead. They are maintained by the kernel.

Security of Linux systems and applications can be greatly improved by using hardening measures. One of these measures is called Linux capabilities. Capabilities are supported by the kernel for some while now. Using capabilities we can strengthen applications and containers.

Capabilities are a great way to split up root permissions and hand out some permissions to non-privileged users. Unfortunately, still many binaries have the setuid bit set, while they should be replaced with capabilities instead.

Normally the root user (or any ID with UID of 0) gets a special treatment when running processes. The kernel and applications are usually programmed to skip the restriction of some activities when seeing this user ID. In other words, this user is allowed to do (almost) anything.

Linux capabilities provide a subset of the available root privileges to a process. This effectively breaks up root privileges into smaller and distinctive units. Each of these units can then be independently be granted to processes. This way the full set of privileges is reduced and decreasing the risks of exploitation.

Capabilities can be thought of as broad classes of privileged functionality that can be selectively removed from a process or user. The specific functions that have capability checks vary from kernel version to kernel version, and there is often bickering between kernel developers over whether or not a given function should require capabilities to run. Generally, reducing capabilities from a process improves security by reducing the number of privileged actions it can perform. Note that some capabilities are considered root-equivalent, meaning that, even if you disable all other capabilities, they can, in some conditions, be used to regain full permissions.


 cat /proc/sys/kernel/cap_last_cap # See the highest capability number for your kernel. The number of capabilities supported by recent Linux versions is close to 40.

# View  list of available Linux capabilities for the active kernel.
 capsh --print 

# View current user's capabilities. Command should return 5 capabilities with hexadecimal numbers.
 cat /proc/{HIGHEST_CAPABILITY_NUMBER}/status | grep Cap 

    # CapInh = Inherited capabilities
    # CapPrm = Permitted capabilities
    # CapEff = Effective capabilities
    # CapBnd = Bounding set
    # CapAmb = Ambient capabilities set

 # Decode hexadecimal number into the capabilities name.
  capsh --decode=0000003fffffffff

# See the capabilities of a running process.
# getpcaps tool uses the capget() system call to query the available capabilities for a particular thread. This system call only needs to provide the PID to obtain more information.
 getpcaps PROCESS_ID

# See the capabilities of a set of processes that have a relationship.
 getpcaps $(pgrep nginx)

# Drop ping capability.
 capsh --drop=cap_net_raw --print -- -c "/bin/ping -c 1 localhost"

Resources

anitsh commented 3 years ago

Namespace

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources. Resources may exist in multiple spaces. Examples of such resources are process IDs, hostnames, user IDs, file names, and some names associated with network access, and interprocess communication. Namespaces are a fundamental aspect of containers on Linux. The term "namespace" is often used for a type of namespace (e.g. process ID) as well as for a particular space of names. A Linux system starts out with a single namespace of each type, used by all processes. Processes can create additional namespaces and join different namespaces. Namespaces are created with the "unshare" command or syscall, or as new flags in a "clone" syscall. Namespaces do not restrict access to physical resources such as CPU, memory and disk. That access is metered and restricted by a kernel feature called ‘cgroups’.

Types

Since kernel version 5.6, there are 8 kinds of namespaces. Namespace functionality is the same across all kinds: each process is associated with a namespace and can only see or use the resources associated with that namespace, and descendant namespaces where applicable. This way each process (or process group thereof) can have a unique view on the resources. Which resource is isolated depends on the kind of namespace that has been created for a given process group.

There are 7 namespaces supported under Linux currently:

OR

Mount - isolate filesystem mount points UTS - isolate hostname and domainname IPC - isolate interprocess communication (IPC) resources PID - isolate the PID number space Network - isolate network interfaces User - isolate UID/GID number spaces Cgroup - isolate cgroup root directory

An example of PID namespaces using the unshare utility:

host ~ # echo $$
25688
host ~ # unshare --fork --pid
host ~ # echo $$
1
host ~ # logout
host ~ # echo $$
25688

While these can be used to augment sandboxing or even be used as an integral part of a sandbox, some of them can reduce security. User namespaces, when unprivileged (the default), expose a much greater attack surface area from the kernel. Many kernel vulnerabilities are exploitable by unprivileged processes when the user namespace is enabled. On some kernels, you can disable unprivileged user namespaces by setting kernel.unprivileged_userns_clone to 0.

Implementation Details

The kernel assigns each process a symbolic link per namespace kind in /proc//ns/. The inode number pointed to by this symlink is the same for each process in this namespace. This uniquely identifies each namespace by the inode number pointed to by one of its symlinks.

Reading the symlink via readlink returns a string containing the namespace kind name and the inode number of the namespace.

Syscalls Three syscalls can directly manipulate namespaces:

  • clone, flags to specify which new namespace the new process should be migrated to.
  • unshare, allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other processes (or threads)
  • setns, enters the namespace specified by a file descriptor.

Destruction If a namespace is no longer referenced, it will be deleted, the handling of the contained resource depends on the namespace kind. Namespaces can be referenced in three ways:

  • by a process belonging to the namespace
  • by an open filedescriptor to the namespace's file (/proc//ns/)
  • a bind mount of the namespace's file (/proc//ns/)

Usage

Various container software use Linux namespaces in combination with cgroups to isolate their processes, including Docker[12] and LXC. Other applications, such as Google Chrome make use of namespaces to isolate its own processes which are at risk from attack on the internet.[13] There is also an unshare wrapper in util-linux. An example to its use is SHELL=/bin/sh unshare --fork --pid chroot "${chrootdir}" "$@"

A process can be created in Linux by the fork(), clone() or vclone() system calls. In order to support namespaces, 6 flags (CLONE_NEW*) were added. These flags (or a combination of them) can be used in clone() or unshare() system calls to create a namespace.

Resource

anitsh commented 3 years ago

cgroups (abbreviated from control groups)

It is a type of namespace that hides the identity of the control group of which process is a member. A process in such a namespace, checking which control group any process is part of, would see a path that is actually relative to the control group set at creation time, hiding its true control group position and identity.

cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.

It is a collection of processes that are bound by the same criteria and associated with a set of parameters or limits. These groups can be hierarchical, meaning that each group inherits limits from its parent group. The kernel provides access to multiple controllers (also called subsystems) through the cgroup interface;[2] for example, the "memory" controller limits memory use, "cpuacct" accounts CPU usage, etc.

Control groups can be used in multiple ways:

cgroups provides:

Kernfs

Kernfs is basically created by splitting off some of the sysfs logic into an independent entity, thus easing for other kernel subsystems the implementation of their own virtual file system with handling for device connect and disconnect, dynamic creation and removal, and other attributes.

Kernel memory control groups (kmemcg)

The kmemcg controller can limit the amount of memory that the kernel can utilize to manage its own internal processes.

anitsh commented 3 years ago

Seccomp

Seccomp (short for secure computing mode) is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit(), sigreturn(), read() and write() to already-open file descriptors. Should it attempt any other system calls, the kernel will terminate the process with SIGKILL or SIGSYS. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

seccomp mode is enabled via the prctl system call using the PR_SET_SECCOMP argument, or (since Linux kernel 3.17) via the seccomp(2) system call. seccomp mode used to be enabled by writing to a file, /proc/self/seccomp, but this method was removed in favor of prctl(). In some kernel versions, seccomp disables the RDTSC x86 instruction, which returns the number of elapsed processor cycles since power-on, used for high-precision timing.[6]

seccomp-bpf is an extension to seccomp[7] that allows filtering of system calls using a configurable policy implemented using Berkeley Packet Filter rules. It is used by OpenSSH and vsftpd as well as the Google Chrome/Chromium web browsers on Chrome OS and Linux.[8] (In this regard seccomp-bpf achieves similar functionality, but with more flexibility and higher performance, to the older systrace—which seems to be no longer supported for Linux.)

Some consider seccomp comparable to OpenBSD pledge and FreeBSD capsicum.

There are two types of seccomp: mode 1 (strict) and mode 2 (filter). Mode 1 is extremely restrictive and, once enabled, only allows four syscalls. These syscalls are read(), write(), exit(), and rt_sigreturn(). A process is immediately sent the fatal SIGKILL signal from the kernel if it ever attempts to use a syscall that is not on the whitelist. This mode is the original seccomp mode and does not require generating and sending eBPF bytecode to the kernel. A special syscall is made, after which mode 1 will be active for the lifetime of the process: seccomp(SECCOMP_SET_MODE_STRICT) or prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT). Once active, it cannot be turned off.

Resource

anitsh commented 3 years ago

AppArmor

AppArmor ("Application Armor") is a Linux kernel security module that allows the system administrator to restrict programs' capabilities with per-program profiles. Profiles can allow capabilities like network access, raw socket access, and the permission to read, write, or execute files on matching paths. AppArmor supplements the traditional Unix discretionary access control (DAC) model by providing mandatory access control (MAC). It has been partially included in the mainline Linux kernel since version 2.6.36 and its development has been supported by Canonical since 2009.

AppArmor gives you network application security via mandatory access control for programs, protecting against the exploitation of software flaws and compromised systems.

AppArmor consists of several different parts:

Resource

anitsh commented 3 years ago

Security-Enhanced Linux (SELinux)

SELinux can potentially control which activities a system allows each user, process, and daemon, with very precise specifications. It is used to confine daemons such as database engines or web servers that have clearly defined data access and activity rights. This limits potential harm from a confined daemon that becomes compromised.

Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies, including mandatory access controls (MAC).

SELinux is a set of kernel modifications and user-space tools that have been added to various Linux distributions. Its architecture strives to separate enforcement of security decisions from the security policy, and streamlines the amount of software involved with security policy enforcement

SELinux features include:

Command-line utilities include: chcon, restorecon, restorecond, runcon, secon, fixfiles, setfiles, load_policy, booleans, getsebool, setsebool, togglesebool, setenforce, semodule, postfix-nochroot, check-selinux-installation, semodule_package, checkmodule, selinux-config-enforcing, selinuxenabled, and selinux-policy-upgrade

Comparison with AppArmor

SELinux represents one of several possible approaches to the problem of restricting the actions that installed software can take. Another popular alternative is called AppArmor and is available on SUSE Linux Enterprise Server (SLES), openSUSE, and Debian-based platforms. AppArmor was developed as a component to the now-defunct Immunix Linux platform. Because AppArmor and SELinux differ radically from one another, they form distinct alternatives for software control. Whereas SELinux re-invents certain concepts to provide access to a more expressive set of policy choices, AppArmor was designed to be simple by extending the same administrative semantics used for DAC up to the mandatory access control level.

There are several key differences:

Resource

anitsh commented 3 years ago

Smack (Simplified Mandatory Access Control Kernel)

Smack (Simplified Mandatory Access Control Kernel) is a Linux kernel security module that protects data and process interaction from malicious manipulation using a set of custom mandatory access control (MAC) rules, with simplicity as its main design goal.

TOMOYO Linux

TOMOYO Linux is a Mandatory Access Control (MAC) implementation for Linux that can be used to increase the security of a system, while also being useful purely as a system analysis tool. It focuses on the behaviour of a system. Every process is created to achieve a purpose, and like an immigration officer, TOMOYO Linux allows each process to declare behaviours and resources needed to achieve their purpose. When protection is enabled, TOMOYO Linux acts like an operation watchdog, restricting each process to only the behaviours and resources allowed by the administrator.

The main features of TOMOYO Linux include:

Sysdig

Csysdig is Sysdig's new curses UI. Think of it as strace + htop + Lua + but with history, output customization, drill down capability and incredible container support.
https://www.youtube.com/watch?v=UJ4wVrbP-Q8

Falco, the open-source cloud-native runtime security project, is the de facto Kubernetes threat detection engine. Falco detects unexpected application behavior and alerts on threats at runtime. Falco requires a driver to listen to the Linux Kernel. This driver can either be:

anitsh commented 3 years ago

pivot_root allows you to set a new root filesystem for the calling process. I.e. it allows you to change what / is. It does this by mounting the current root filesystem somewhere else while simultaneously mounting some new root filesystem on /. Once the previous root has been moved, it is then possible to umount it. Thus we have a mechanism for 'clearing' the hosts's mounts from inside a new Mount namespace - we simply pivot away and then umount them!

anitsh commented 3 years ago

eBPF #256

eBPF can run sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules.