System-call metadata - Githubissues

trombonehero commented 7 years ago

It may be useful to allow system calls to accept and return semi-arbitrary metadata via some out-of-band mechanism à la errno.

trombonehero commented 7 years ago

Example use cases include:

call numbering (to link kernel- and user-level call traces)
UUID-based descriptor uniquifying (to eliminate FD races in traces)
retrieving other labels atomically with open
accounting

trombonehero commented 7 years ago

As a first effort, we should add:

read_uuid(2): perform a normal read(2) but store the file's UUID in a buffer provided by userspace
mmap_uuid(2): perform a normal mmap(2) but store the file's UUID in a buffer provided by userspace
write_uuid(2): accept a vector of UUIDs denoting sources alongside the actual write(2)

rwatson commented 7 years ago

Current thinking is that the additional arguments would be a struct iovec and a count, which would point at an array of UUIDs to be read or written by the kernel. It could be that read(2) and friends always return a single UUID, but write(2) will need to accept multiple UUIDs to support tracking mixes of data sources combined in the process and written out together.

rwatson commented 7 years ago

In the first pass, I plan to introduce the following new system calls. If there are other system calls you feel are important in the short term, please follow up on this issue:

read_md(2), readv_md(2), pread_md(2), preadv_md(2) - will return additional metadata
recv_md(2), recvfrom_md(2), recvmsg_md(2) - will return additional metadata
write_md(2), writev_md(2), pwrite_md(2), pwritev_md(2) - will accept additional metadata
send_md(2), sendto_md(2), sendmsg_md(2) - will accept additional metadata
sendfile_md(2) - will accept additional metadata (associated with the header data being sent from userspace)

In the longer term, we will likely want to hit more I/O functions including the aio(4) asynchronous I/O system calls. I have attempted to focus, in the above, on widely used data read and write calls, rather than filesystem or socket metadata (see below).

I wonder also if we want to be able to express additional metadata input and output arguments for the following system calls:

open(2), openat(2)
stat(2), fstat(2), ...
accept(2), bind(2), bindat(2), connect(2), connectat(2)
link(2), linkat(2), mkdir(2), mkdirat(2), symlink(2), symlinkat(2)
close(2), closefrom(2)
readdir(2)
Other operations on the directory structure..?
Other filesystem metadata operations such as chmod(2), chown(2), chflags(2), ...acl...(2), ...extattr...(2)?

(Names for new system calls are subject to revision -- suggestions here would be most welcome, as I don't really like these ones!)

rwatson commented 7 years ago

Oh, neglected the following:

mmap_md(2) - will return additional metadata

rwatson commented 7 years ago

My current leaning is to use a metadata structure like this in the first cut, with full awareness of the limitations it imposes on what can be represented, potential future binary compatibility, etc:

struct syscall_metadata {
    uint64_t sm_syscall_id;   // Uniquely identify system call in trace (read/receive only)
    uint64_t _sm_padding0;
    uint64_t _sm_padding1;
    uint64_t _sm_padding2;
    struct uuid sm_uuid[1];   // UUID that originated this data - arbitrarily 1 UUID for now
};

In the future, we would like to support:

message ID and byte-range information for read/receive I/O, to allow that information to be propagated as well, where available
message ID and byte-range information for write/send I/O, to allow that information to be propagated as well, where available
multiple UUIDs as the destination of write/send I/O, to allow "mixing" of data in user space to be captured

trombonehero commented 7 years ago

Since we have padding space, might it make sense to also include the tid in this structure?

I think it's probably fine to assume a single UUID for now, but we will definitely want to revisit this once we get past simple applications like cp. Once we start instrumenting network applications I expect we'll see a lot of "socket UUID + local file UUID" mixing.

trombonehero commented 7 years ago

Also, it's too bad that a cap_rights_t no longer fits in 64b. :)

trombonehero commented 7 years ago

Perhaps _meta would be a useful suffix? md is already a bit overloaded (machine-dependent, memory disk). For that matter, we could even use something longer like _with_metadata: the intent is not for programmers to have to type these names.

rwatson commented 7 years ago

On the tid: is the aim there just to provide convenient access to the kernel's notion of the thread ID when returning control to userspace after a read(2)/recv(2) system call, or are you also interested in propagating thread information through write(2)/send(2) so that the audit trail on the latter includes some notion of originating thread?

rwatson commented 7 years ago

(if the latter, the system-call ID is presumably sufficient as it cross-references another audit record that already contains the thread ID .. but if the former (i.e., to index a per-thread structure) then that's very easy to do!)

trombonehero commented 7 years ago

I had been thinking of the former: an additional piece of information to reconcile userspace and kernel traces. Perhaps the system-call ID will be unique across threads, but I imagine we won't want to guarantee that.

rwatson commented 7 years ago

Right now our message-ID primitive (#53) provides globally unique message identifiers by combining an 8-bit CPU ID with a 56-bit atomically incremented message ID. While 56 bits is quite a lot, I agree it's probably not large enough to uniquely capture all system calls in a system over the lifetime of the system. It's a good question how strong a uniqueness property we want to provide, and whether we can allocate/manage efficiently using solely a 64-bit value -- e.g., do we also need a larger per-CPU epoch pushing us up to 96 bits? (It's not just about the uniqueness of those value, but also allocating those values efficiently, hence the CPU ID + counter design...). An interesting question is whether the thread ID helps with that uniqueness: if you have a long-lived application (e.g., years) with one thread per host CPU, then it's no more unique than the CPU ID...

rwatson commented 7 years ago

(And just to be explicit, since I think we're both implicitly talking in these terms anyway: we would ideally be resistant to an adversary who is eager to run out message IDs, system call IDs, etc..? And also work well for a cooperative application that might be tightly interwoven over multiple processes -- e.g., a Postgres/Oracle-like application?)

rssohan commented 7 years ago

Might I suggest extending existing syscalls with an "md" parameter? This: a. Provides a consistent md interface/interpretation for all system calls. b. Can be easily IFDEFed out if not needed c. Doesn't extend the syscall table with new syscalls that a specialisation of existing ones d. Can be implemented piecemeal, with lower barrier to entry than new syscall. e. Tighter integration with existing syscall infrastructure.

rssohan commented 7 years ago

Another approach, which might be worthy of evaluation - in Resourceful we used a k/u shared memory page + character device based cplane to allow the uspace application to provide and obtain information associated with the next and just returned syscall. This allowed it to be implemented as a kernel module with zero changes to the syscall table. We found it worked really well and low overhead for our use-case.

rwatson commented 7 years ago

@rssohan: we had a long chat in Boston about a more generic system-call metadata mechanism avoiding explicit argument use but concluded that explicit system calls substituted by the Loom compiler/runtime where needed provided the easiest short-term demonstration path while simultaneously pursuing a longer-term approach. (And took an as-yet unpursued TODO to chase up OPUS folks to ask about the mechanics you use as well).

It turned out to be remarkably hard to specify a race-free metadata mechanism for system calls, due to signal-handling semantics. For example, signals can deliver on the last instruction before a system call is made (potentially multiple times), or before the first instruction on system-call return (similarly potentially multiple times). Signals themselves are (pretty much) guaranteed to make at least one system call (sigreturn(2)), and may make more (despite general recommendations to be cautious about this), making notions of "the next system call" and "the previous system call" hard to reason about or depend on from generated user ode (especially if multiple signals fire in a row, or if the timer signal is being used to implement threading-like preemption services, where the signal handler may return to a different PC than natural code flow in interrupted code would suggest).

The future model we came up with (feedback most welcome) is to extend the system-call ABI (e.g., via an additional optional value on the stack or register -- details TBD but likely machine-architecture/ABI-specific) with a system-call sequence number that could then be used to uniquely identify data in a shared page where ambiguity might otherwise arise. However, on the basis that we'd like to prototype the Loom aspects sooner, a set of temporary system-call extensions seemed like a plausible short-term route -- especially if they are utilised only by Loom and explicit test cases, and do not undergo a more general propagation to user code. This would allow us to experiment with both user propagation and audit semantics in the interim, while continuing to ponder a "right" long-term solution that is suitably concurrency safe.

Does this seem sensible?

rssohan commented 7 years ago

Sure, this seems sensible. Sorry, I didn't know the Boston meeting context when I wrote it.

rwatson commented 7 years ago

Commit 96ec5f8cf99ddbdc6d401982872bca7c4f361443 adds test cases for various metaio(2) system calls to confirm that I/O on various file-descriptor types returns the correct underlying object UUID.

rwatson commented 7 years ago

Making as 'feedback' to seek feedback from @trombonehero as the experiments with these APIs. To use the features, he will need to compile a kernel that contains at least options AUDIT, options KDTRACE_HOOKS, and options METAIO, and utilise the DTrace audit provider to monitor ar_arg_objuuid1/ar_arg_objuuid2/ar_ret_objuuid1/ar_ret_objuuid2 (where argument and returned object UUIDs are stored) and ar_arg_metaio (which is where user-submitted metaio state ends up). This should be sufficient to track data provenance across the sample cp_metaio(1) command.

rwatson commented 7 years ago

This simple (boring) test script, combined with the test program cp_metaio(1) should show off a combination of argument/return UUIDs and argument metaio:

#pragma D option quiet

#define ARG_UPATH1              0x0000000002000000ULL
#define ARG_UPATH2              0x0000000004000000ULL
#define ARG_OBJUUID1            0x0080000000000000ULL
#define ARG_OBJUUID2            0x0100000000000000ULL
#define ARG_METAIO              0x0400000000000000ULL
#define RET_OBJUUID1            0x0000000000000001ULL
#define RET_OBJUUID2            0x0000000000000002ULL

#define ARG_HAS_UPATH1(ar)      ((ar)->ar_valid_arg & ARG_UPATH1)
#define ARG_HAS_OBJUUID1(ar)    ((ar)->ar_valid_arg & ARG_OBJUUID1)
#define ARG_HAS_UPATH2(ar)      ((ar)->ar_valid_arg & ARG_UPATH2)
#define ARG_HAS_OBJUUID2(ar)    ((ar)->ar_valid_arg & ARG_OBJUUID2)
#define ARG_HAS_METAIO(ar)      ((ar)->ar_valid_arg & ARG_METAIO)
#define RET_HAS_OBJUUID1(ar)    ((ar)->ar_valid_ret & RET_OBJUUID1)
#define RET_HAS_OBJUUID2(ar)    ((ar)->ar_valid_ret & RET_OBJUUID2)

audit:::commit
/execname == "cp_metaio" &&
  (ARG_HAS_OBJUUID1(args[1]) || ARG_HAS_OBJUUID2(args[1]) ||
   ARG_HAS_METAIO(args[1]) || RET_HAS_OBJUUID1(args[1]) ||
   RET_HAS_OBJUUID2(args[1]))/
{
        printf("%s:%s:%s:%s:\n", probeprov, probemod, probefunc, probename);
        printf("  path1: %s\n", ARG_HAS_UPATH1(args[1]) ?
            args[1]->ar_arg_upath1 : "-");
        printf("  arg1: %s\n", ARG_HAS_OBJUUID1(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_objuuid1) : "-");
        printf("  path2: %s\n", ARG_HAS_UPATH2(args[1]) ?
            args[1]->ar_arg_upath2 : "-");
        printf("  arg2: %s\n", ARG_HAS_OBJUUID2(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_objuuid2) : "-");
        printf("  metaio: %s\n", ARG_HAS_METAIO(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_metaio.mio_uuid) : "-");
        printf("  ret1: %s\n", RET_HAS_OBJUUID1(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_objuuid1) : "-");
        printf("  ret2: %s\n", RET_HAS_OBJUUID2(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_objuuid2) : "-");
}

And should be executed using: sudo dtrace -Cs metaio.d

It should return sequences such as the following:

audit:event:aue_openat_rwtc:commit:
  path1: /usr/home/robert/foo/bar
  arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
  path2: -
  arg2: -
  metaio: -
  ret1: c88b040e-e1e7-4653-a7e1-cccd93462876
  ret2: -
audit:event:aue_openat_rwtc:commit:
  path1: /usr/home/robert/foo/bar.1
  arg1: 455848c2-8218-4859-9882-c3b4394845fb
  path2: -
  arg2: -
  metaio: -
  ret1: 455848c2-8218-4859-9882-c3b4394845fb
  ret2: -
audit:event:aue_read:commit:
  path1: -
  arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
  path2: -
  arg2: -
  metaio: -
  ret1: -
  ret2: -
audit:event:aue_write:commit:
  path1: -
  arg1: 455848c2-8218-4859-9882-c3b4394845fb
  path2: -
  arg2: -
  metaio: c88b040e-e1e7-4653-a7e1-cccd93462876
  ret1: -
  ret2: -
audit:event:aue_read:commit:
  path1: -
  arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
  path2: -
  arg2: -
  metaio: -
  ret1: -
  ret2: -
audit:event:aue_close:commit:
  path1: -
  arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
  path2: -
  arg2: -
  metaio: -
  ret1: -
  ret2: -
audit:event:aue_close:commit:
  path1: -
  arg1: 455848c2-8218-4859-9882-c3b4394845fb
  path2: -
  arg2: -
  metaio: -
  ret1: -
  ret2: -

In which the contents of the file bar are copied into the file bar.1, with cp_metaio(1) propagating input metadata, retrieved on metatio_read(2) to the output via metaio_write(2).

trombonehero commented 7 years ago

Trying again with options METAIO properly enabled...

rwatson commented 7 years ago

In principle, failing to compile in options METAIO should lead to cp_metaio(1) failing with a missing system call, as we currently return ENOSYS in the event that the option is not present. However, it could be that we also need to set the second return-value register so that the caller detects this...

trombonehero commented 7 years ago

Success:

audit:event:aue_write:commit:
  path1: -
  arg1: 2d788e23-ebe7-ae52-a7eb-213f42aef351
  path2: -
  arg2: -
  metaio: 27a824e1-f4c1-fa59-81f4-c70b59fabd28
  ret1: -
  ret2: -

trombonehero commented 7 years ago

Removing feedback label for now: the current approach with manual C munging seems to work well, and I have an approach for reproducing it with LLVM. Will tag commits as they come...

rwatson commented 7 years ago

Woohoo, etc!

trombonehero commented 7 years ago

In theory, the above commits should implement what we need to instrument applications like cp(1) that only require intraprocedural information flow tracking. I will test this theory... likely on Monday.

rwatson commented 7 years ago

There are a few bugs in the above DTrace script, which incorrectly uses argument instead of return object UUIDs in a few places. This is a more preferred version:

#pragma D option quiet

#define ARG_UPATH1              0x0000000002000000ULL
#define ARG_UPATH2              0x0000000004000000ULL
#define ARG_OBJUUID1            0x0080000000000000ULL
#define ARG_OBJUUID2            0x0100000000000000ULL
#define ARG_METAIO              0x0400000000000000ULL
#define RET_OBJUUID1            0x0000000000000001ULL
#define RET_OBJUUID2            0x0000000000000002ULL
#define RET_METAIO              0x0000000000000040ULL

#define ARG_HAS_UPATH1(ar)      ((ar)->ar_valid_arg & ARG_UPATH1)
#define ARG_HAS_OBJUUID1(ar)    ((ar)->ar_valid_arg & ARG_OBJUUID1)
#define ARG_HAS_UPATH2(ar)      ((ar)->ar_valid_arg & ARG_UPATH2)
#define ARG_HAS_OBJUUID2(ar)    ((ar)->ar_valid_arg & ARG_OBJUUID2)
#define ARG_HAS_METAIO(ar)      ((ar)->ar_valid_arg & ARG_METAIO)
#define RET_HAS_OBJUUID1(ar)    ((ar)->ar_valid_ret & RET_OBJUUID1)
#define RET_HAS_OBJUUID2(ar)    ((ar)->ar_valid_ret & RET_OBJUUID2)
#define RET_HAS_METAIO(ar)      ((ar)->ar_valid_ret & RET_METAIO)

audit:::commit
/execname == "shmtest" &&
  (ARG_HAS_OBJUUID1(args[1]) || ARG_HAS_OBJUUID2(args[1]) ||
   ARG_HAS_METAIO(args[1]) || RET_HAS_OBJUUID1(args[1]) ||
   RET_HAS_OBJUUID2(args[1]) || RET_HAS_METAIO(args[1]))/
{
        printf("%s:%s:%s:%s:\n", probeprov, probemod, probefunc, probename);   printf("  %x %x\n", args[1]->ar_valid_arg, args[1]->ar_valid_ret);
        printf("  path1: %s\n", ARG_HAS_UPATH1(args[1]) ?
            args[1]->ar_arg_upath1 : "-");
        printf("  arg_objuuid1: %s\n", ARG_HAS_OBJUUID1(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_objuuid1) : "-");
        printf("  path2: %s\n", ARG_HAS_UPATH2(args[1]) ?
            args[1]->ar_arg_upath2 : "-");
        printf("  arg_objuuid2: %s\n", ARG_HAS_OBJUUID2(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_objuuid2) : "-");
        printf("  arg_metaio: %s\n", ARG_HAS_METAIO(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_arg_metaio.mio_uuid) : "-");
        printf("  ret_objuuid1: %s\n", RET_HAS_OBJUUID1(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_ret_objuuid1) : "-");
        printf("  ret_objuuid2: %s\n", RET_HAS_OBJUUID2(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_ret_objuuid2) : "-");
        printf("  ret_metaio: %s\n", RET_HAS_METAIO(args[1]) ?
            uuidtostr((intptr_t)&args[1]->ar_ret_metaio.mio_uuid) : "-");
}

cadets / freebsd-old

System-call metadata #58