Open trombonehero opened 7 years ago
Example use cases include:
open
As a first effort, we should add:
read_uuid(2)
: perform a normal read(2)
but store the file's UUID in a buffer provided by userspacemmap_uuid(2)
: perform a normal mmap(2)
but store the file's UUID in a buffer provided by userspacewrite_uuid(2)
: accept a vector of UUIDs denoting sources alongside the actual write(2)
Current thinking is that the additional arguments would be a struct iovec
and a count, which would point at an array of UUIDs to be read or written by the kernel. It could be that read(2)
and friends always return a single UUID, but write(2)
will need to accept multiple UUIDs to support tracking mixes of data sources combined in the process and written out together.
In the first pass, I plan to introduce the following new system calls. If there are other system calls you feel are important in the short term, please follow up on this issue:
read_md(2)
, readv_md(2)
, pread_md(2)
, preadv_md(2)
- will return additional metadatarecv_md(2)
, recvfrom_md(2)
, recvmsg_md(2)
- will return additional metadatawrite_md(2)
, writev_md(2)
, pwrite_md(2)
, pwritev_md(2)
- will accept additional metadatasend_md(2)
, sendto_md(2)
, sendmsg_md(2)
- will accept additional metadatasendfile_md(2)
- will accept additional metadata (associated with the header data being sent from userspace)In the longer term, we will likely want to hit more I/O functions including the aio(4)
asynchronous I/O system calls. I have attempted to focus, in the above, on widely used data read and write calls, rather than filesystem or socket metadata (see below).
I wonder also if we want to be able to express additional metadata input and output arguments for the following system calls:
open(2)
, openat(2)
stat(2)
, fstat(2)
, ...accept(2)
, bind(2)
, bindat(2)
, connect(2)
, connectat(2)
link(2)
, linkat(2)
, mkdir(2)
, mkdirat(2)
, symlink(2)
, symlinkat(2)
close(2)
, closefrom(2)
readdir(2)
chmod(2)
, chown(2)
, chflags(2)
, ...acl...(2)
, ...extattr...(2)
?(Names for new system calls are subject to revision -- suggestions here would be most welcome, as I don't really like these ones!)
Oh, neglected the following:
mmap_md(2)
- will return additional metadataMy current leaning is to use a metadata structure like this in the first cut, with full awareness of the limitations it imposes on what can be represented, potential future binary compatibility, etc:
struct syscall_metadata {
uint64_t sm_syscall_id; // Uniquely identify system call in trace (read/receive only)
uint64_t _sm_padding0;
uint64_t _sm_padding1;
uint64_t _sm_padding2;
struct uuid sm_uuid[1]; // UUID that originated this data - arbitrarily 1 UUID for now
};
In the future, we would like to support:
Since we have padding space, might it make sense to also include the tid in this structure?
I think it's probably fine to assume a single UUID for now, but we will definitely want to revisit this once we get past simple applications like cp
. Once we start instrumenting network applications I expect we'll see a lot of "socket UUID + local file UUID" mixing.
Also, it's too bad that a cap_rights_t
no longer fits in 64b. :)
Perhaps _meta
would be a useful suffix? md
is already a bit overloaded (machine-dependent, memory disk). For that matter, we could even use something longer like _with_metadata
: the intent is not for programmers to have to type these names.
On the tid: is the aim there just to provide convenient access to the kernel's notion of the thread ID when returning control to userspace after a read(2)
/recv(2)
system call, or are you also interested in propagating thread information through write(2)
/send(2)
so that the audit trail on the latter includes some notion of originating thread?
(if the latter, the system-call ID is presumably sufficient as it cross-references another audit record that already contains the thread ID .. but if the former (i.e., to index a per-thread structure) then that's very easy to do!)
I had been thinking of the former: an additional piece of information to reconcile userspace and kernel traces. Perhaps the system-call ID will be unique across threads, but I imagine we won't want to guarantee that.
Right now our message-ID primitive (#53) provides globally unique message identifiers by combining an 8-bit CPU ID with a 56-bit atomically incremented message ID. While 56 bits is quite a lot, I agree it's probably not large enough to uniquely capture all system calls in a system over the lifetime of the system. It's a good question how strong a uniqueness property we want to provide, and whether we can allocate/manage efficiently using solely a 64-bit value -- e.g., do we also need a larger per-CPU epoch pushing us up to 96 bits? (It's not just about the uniqueness of those value, but also allocating those values efficiently, hence the CPU ID + counter design...). An interesting question is whether the thread ID helps with that uniqueness: if you have a long-lived application (e.g., years) with one thread per host CPU, then it's no more unique than the CPU ID...
(And just to be explicit, since I think we're both implicitly talking in these terms anyway: we would ideally be resistant to an adversary who is eager to run out message IDs, system call IDs, etc..? And also work well for a cooperative application that might be tightly interwoven over multiple processes -- e.g., a Postgres/Oracle-like application?)
Might I suggest extending existing syscalls with an "md" parameter? This: a. Provides a consistent md interface/interpretation for all system calls. b. Can be easily IFDEFed out if not needed c. Doesn't extend the syscall table with new syscalls that a specialisation of existing ones d. Can be implemented piecemeal, with lower barrier to entry than new syscall. e. Tighter integration with existing syscall infrastructure.
Another approach, which might be worthy of evaluation - in Resourceful we used a k/u shared memory page + character device based cplane to allow the uspace application to provide and obtain information associated with the next and just returned syscall. This allowed it to be implemented as a kernel module with zero changes to the syscall table. We found it worked really well and low overhead for our use-case.
@rssohan: we had a long chat in Boston about a more generic system-call metadata mechanism avoiding explicit argument use but concluded that explicit system calls substituted by the Loom compiler/runtime where needed provided the easiest short-term demonstration path while simultaneously pursuing a longer-term approach. (And took an as-yet unpursued TODO to chase up OPUS folks to ask about the mechanics you use as well).
It turned out to be remarkably hard to specify a race-free metadata mechanism for system calls, due to signal-handling semantics. For example, signals can deliver on the last instruction before a system call is made (potentially multiple times), or before the first instruction on system-call return (similarly potentially multiple times). Signals themselves are (pretty much) guaranteed to make at least one system call (sigreturn(2)), and may make more (despite general recommendations to be cautious about this), making notions of "the next system call" and "the previous system call" hard to reason about or depend on from generated user ode (especially if multiple signals fire in a row, or if the timer signal is being used to implement threading-like preemption services, where the signal handler may return to a different PC than natural code flow in interrupted code would suggest).
The future model we came up with (feedback most welcome) is to extend the system-call ABI (e.g., via an additional optional value on the stack or register -- details TBD but likely machine-architecture/ABI-specific) with a system-call sequence number that could then be used to uniquely identify data in a shared page where ambiguity might otherwise arise. However, on the basis that we'd like to prototype the Loom aspects sooner, a set of temporary system-call extensions seemed like a plausible short-term route -- especially if they are utilised only by Loom and explicit test cases, and do not undergo a more general propagation to user code. This would allow us to experiment with both user propagation and audit semantics in the interim, while continuing to ponder a "right" long-term solution that is suitably concurrency safe.
Does this seem sensible?
Sure, this seems sensible. Sorry, I didn't know the Boston meeting context when I wrote it.
Commit 96ec5f8cf99ddbdc6d401982872bca7c4f361443 adds test cases for various metaio(2)
system calls to confirm that I/O on various file-descriptor types returns the correct underlying object UUID.
Making as 'feedback' to seek feedback from @trombonehero as the experiments with these APIs. To use the features, he will need to compile a kernel that contains at least options AUDIT
, options KDTRACE_HOOKS
, and options METAIO
, and utilise the DTrace audit provider to monitor ar_arg_objuuid1
/ar_arg_objuuid2
/ar_ret_objuuid1
/ar_ret_objuuid2
(where argument and returned object UUIDs are stored) and ar_arg_metaio
(which is where user-submitted metaio
state ends up). This should be sufficient to track data provenance across the sample cp_metaio(1)
command.
This simple (boring) test script, combined with the test program cp_metaio(1)
should show off a combination of argument/return UUIDs and argument metaio:
#pragma D option quiet
#define ARG_UPATH1 0x0000000002000000ULL
#define ARG_UPATH2 0x0000000004000000ULL
#define ARG_OBJUUID1 0x0080000000000000ULL
#define ARG_OBJUUID2 0x0100000000000000ULL
#define ARG_METAIO 0x0400000000000000ULL
#define RET_OBJUUID1 0x0000000000000001ULL
#define RET_OBJUUID2 0x0000000000000002ULL
#define ARG_HAS_UPATH1(ar) ((ar)->ar_valid_arg & ARG_UPATH1)
#define ARG_HAS_OBJUUID1(ar) ((ar)->ar_valid_arg & ARG_OBJUUID1)
#define ARG_HAS_UPATH2(ar) ((ar)->ar_valid_arg & ARG_UPATH2)
#define ARG_HAS_OBJUUID2(ar) ((ar)->ar_valid_arg & ARG_OBJUUID2)
#define ARG_HAS_METAIO(ar) ((ar)->ar_valid_arg & ARG_METAIO)
#define RET_HAS_OBJUUID1(ar) ((ar)->ar_valid_ret & RET_OBJUUID1)
#define RET_HAS_OBJUUID2(ar) ((ar)->ar_valid_ret & RET_OBJUUID2)
audit:::commit
/execname == "cp_metaio" &&
(ARG_HAS_OBJUUID1(args[1]) || ARG_HAS_OBJUUID2(args[1]) ||
ARG_HAS_METAIO(args[1]) || RET_HAS_OBJUUID1(args[1]) ||
RET_HAS_OBJUUID2(args[1]))/
{
printf("%s:%s:%s:%s:\n", probeprov, probemod, probefunc, probename);
printf(" path1: %s\n", ARG_HAS_UPATH1(args[1]) ?
args[1]->ar_arg_upath1 : "-");
printf(" arg1: %s\n", ARG_HAS_OBJUUID1(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_objuuid1) : "-");
printf(" path2: %s\n", ARG_HAS_UPATH2(args[1]) ?
args[1]->ar_arg_upath2 : "-");
printf(" arg2: %s\n", ARG_HAS_OBJUUID2(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_objuuid2) : "-");
printf(" metaio: %s\n", ARG_HAS_METAIO(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_metaio.mio_uuid) : "-");
printf(" ret1: %s\n", RET_HAS_OBJUUID1(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_objuuid1) : "-");
printf(" ret2: %s\n", RET_HAS_OBJUUID2(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_objuuid2) : "-");
}
And should be executed using: sudo dtrace -Cs metaio.d
It should return sequences such as the following:
audit:event:aue_openat_rwtc:commit:
path1: /usr/home/robert/foo/bar
arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
path2: -
arg2: -
metaio: -
ret1: c88b040e-e1e7-4653-a7e1-cccd93462876
ret2: -
audit:event:aue_openat_rwtc:commit:
path1: /usr/home/robert/foo/bar.1
arg1: 455848c2-8218-4859-9882-c3b4394845fb
path2: -
arg2: -
metaio: -
ret1: 455848c2-8218-4859-9882-c3b4394845fb
ret2: -
audit:event:aue_read:commit:
path1: -
arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
path2: -
arg2: -
metaio: -
ret1: -
ret2: -
audit:event:aue_write:commit:
path1: -
arg1: 455848c2-8218-4859-9882-c3b4394845fb
path2: -
arg2: -
metaio: c88b040e-e1e7-4653-a7e1-cccd93462876
ret1: -
ret2: -
audit:event:aue_read:commit:
path1: -
arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
path2: -
arg2: -
metaio: -
ret1: -
ret2: -
audit:event:aue_close:commit:
path1: -
arg1: c88b040e-e1e7-4653-a7e1-cccd93462876
path2: -
arg2: -
metaio: -
ret1: -
ret2: -
audit:event:aue_close:commit:
path1: -
arg1: 455848c2-8218-4859-9882-c3b4394845fb
path2: -
arg2: -
metaio: -
ret1: -
ret2: -
In which the contents of the file bar
are copied into the file bar.1
, with cp_metaio(1)
propagating input metadata, retrieved on metatio_read(2)
to the output via metaio_write(2)
.
Trying again with options METAIO
properly enabled...
In principle, failing to compile in options METAIO
should lead to cp_metaio(1)
failing with a missing system call, as we currently return ENOSYS in the event that the option is not present. However, it could be that we also need to set the second return-value register so that the caller detects this...
Success:
audit:event:aue_write:commit:
path1: -
arg1: 2d788e23-ebe7-ae52-a7eb-213f42aef351
path2: -
arg2: -
metaio: 27a824e1-f4c1-fa59-81f4-c70b59fabd28
ret1: -
ret2: -
Removing feedback label for now: the current approach with manual C munging seems to work well, and I have an approach for reproducing it with LLVM. Will tag commits as they come...
Woohoo, etc!
In theory, the above commits should implement what we need to instrument applications like cp(1)
that only require intraprocedural information flow tracking. I will test this theory... likely on Monday.
There are a few bugs in the above DTrace script, which incorrectly uses argument instead of return object UUIDs in a few places. This is a more preferred version:
#pragma D option quiet
#define ARG_UPATH1 0x0000000002000000ULL
#define ARG_UPATH2 0x0000000004000000ULL
#define ARG_OBJUUID1 0x0080000000000000ULL
#define ARG_OBJUUID2 0x0100000000000000ULL
#define ARG_METAIO 0x0400000000000000ULL
#define RET_OBJUUID1 0x0000000000000001ULL
#define RET_OBJUUID2 0x0000000000000002ULL
#define RET_METAIO 0x0000000000000040ULL
#define ARG_HAS_UPATH1(ar) ((ar)->ar_valid_arg & ARG_UPATH1)
#define ARG_HAS_OBJUUID1(ar) ((ar)->ar_valid_arg & ARG_OBJUUID1)
#define ARG_HAS_UPATH2(ar) ((ar)->ar_valid_arg & ARG_UPATH2)
#define ARG_HAS_OBJUUID2(ar) ((ar)->ar_valid_arg & ARG_OBJUUID2)
#define ARG_HAS_METAIO(ar) ((ar)->ar_valid_arg & ARG_METAIO)
#define RET_HAS_OBJUUID1(ar) ((ar)->ar_valid_ret & RET_OBJUUID1)
#define RET_HAS_OBJUUID2(ar) ((ar)->ar_valid_ret & RET_OBJUUID2)
#define RET_HAS_METAIO(ar) ((ar)->ar_valid_ret & RET_METAIO)
audit:::commit
/execname == "shmtest" &&
(ARG_HAS_OBJUUID1(args[1]) || ARG_HAS_OBJUUID2(args[1]) ||
ARG_HAS_METAIO(args[1]) || RET_HAS_OBJUUID1(args[1]) ||
RET_HAS_OBJUUID2(args[1]) || RET_HAS_METAIO(args[1]))/
{
printf("%s:%s:%s:%s:\n", probeprov, probemod, probefunc, probename); printf(" %x %x\n", args[1]->ar_valid_arg, args[1]->ar_valid_ret);
printf(" path1: %s\n", ARG_HAS_UPATH1(args[1]) ?
args[1]->ar_arg_upath1 : "-");
printf(" arg_objuuid1: %s\n", ARG_HAS_OBJUUID1(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_objuuid1) : "-");
printf(" path2: %s\n", ARG_HAS_UPATH2(args[1]) ?
args[1]->ar_arg_upath2 : "-");
printf(" arg_objuuid2: %s\n", ARG_HAS_OBJUUID2(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_objuuid2) : "-");
printf(" arg_metaio: %s\n", ARG_HAS_METAIO(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_arg_metaio.mio_uuid) : "-");
printf(" ret_objuuid1: %s\n", RET_HAS_OBJUUID1(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_ret_objuuid1) : "-");
printf(" ret_objuuid2: %s\n", RET_HAS_OBJUUID2(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_ret_objuuid2) : "-");
printf(" ret_metaio: %s\n", RET_HAS_METAIO(args[1]) ?
uuidtostr((intptr_t)&args[1]->ar_ret_metaio.mio_uuid) : "-");
}
It may be useful to allow system calls to accept and return semi-arbitrary metadata via some out-of-band mechanism à la
errno
.