Add audit and dtaudit support for batched system calls (e.g., closefrom(2), lio_listio(2))

rwatson commented 8 years ago

FreeBSD (and POSIX) support a small number of "batched" (compound?) system calls, which mitigate system-call overhead (or provide improved semantics) by operating on a number of kernel-implemented objects. Typically, this is done by providing a vector or range of file-descriptor numbers, which means these calls are not well supported by audit, which assumes a small number objects (e.g., 2) being affected by a single system call. Prime examples of this are closefrom(2), which can close a range of file descriptors, and lio_listio(2), which submits a vector of asynchronous I/O operations that are potentially on many different file descriptors. To properly capture object information, we might consider taking one of two (or more?) possible approaches:

Extend the audit mechanism to allow vectors of objects to be described by audit records, both internally (in struct audit_record) and in the output format (in BSM). The BSM format lends itself to this by virtue of supporting additional tokens, although there might be length and parsing concerns. The in-kernel audit record might need to support dynamic memory allocation.
Extend the audit mechanism to allow a series of audit records to be submitted, each describing a successful (or failed) element of the overall operation, as well as an overall audit record for the containing system calls. These would need to be cross-indexed somehow -- e.g., by a shared unique identifier linking the container and the contributing operation records. This might encounter problems with the bound on outstanding audit records, and require extensions to how audit records are separately allocated and annotated if not the 'default' on the system call used by the AUDIT_ARG...() macros.

My initial leaning is to pursue the latter approach, which would lead to independent dtaudit probes firing for each element, represented by its own record commit, which might be easier to process. This would also establish timestamps for the elements of the compound system call, independent return values (e.g., if close(2) would have returned an error for one of the ope file descriptors), etc.

This issue is to track addressing this general problem. Among other aspects, a more thorough survey if system calls, both in the FreeBSD ABI, and other ABIs such as Linux, needs to be performed to see whether there may be further requirements beyond those identified here.

(This issue may be of interest to @lc525, @arunthomas.)

lc525 commented 8 years ago

In regards to separate records (solution 2) vs one large record (solution 1): no difference from the perspective of analysis/query tools as long as semantics are preserved, although performance considerations might come into play.

And related to performance, I'm thinking some cases might be dealt-with in post-processing. For example, OPUS maintains a state machine for each process (maintaining things like file descriptors, process relationships etc). When seeing a closefrom(int fd) in the trace, we know what file descriptors larger than fd are available to the process, and can push "virtual" close events on each of them towards the provenance database.

However, there is the issue of failure modes: closefrom(2) can not fail (afaik), but other batched calls might, at which point we don't have enough information about what was committed and what not. So unless the kernel always guarantees transaction-like qualities, separate records describing the fate of each operation might be the right solution.

I'll put together a list of problematic system calls (from both the UUID perspective as well as lack of required arguments/nice-to-haves, etc) as I've gone through most of them individually for creating the CADETS->OPUS translation layer. This is currently postponed until after the demo is ready.

rwatson commented 8 years ago

Although closefrom(2) cannot fail (it has no return value), the constituent close(2) operations can return failures (e.g., on AFS write back failure). So while we're not strictly required to audit those results, it wouldn't hurt to do so.

In the case of lio_listio, the system call can return an error even though some I/O operations have in fact been started. In this case, per-operation errors would provide more granular information about which operations were started, and which failed.

So it sounds like there probably is value in capturing constituent operation failure modes.

rwatson commented 8 years ago

(Although, on the topic of close(2) failures, the defined behaviour is that although an error can be reported by the filesystem (...) when closing an open file descriptor, the file descriptor will still be successfully closed on return, so the error is advisory.)

rwatson commented 8 years ago

Arguably, system calls such as kill(2) are also affected by this, as they may act on multiple processes. Currently, we audit information on only the last process operated on, but really we should capture information on all affected processes.

cadets / freebsd-old

Add audit and dtaudit support for batched system calls (e.g., closefrom(2), lio_listio(2)) #37