minad commented 5 years ago

vdso calls might not go through seccomp. See http://man7.org/linux/man-pages/man2/seccomp.2.html. This is probably not an issue which increases the attack surface since the kernel is never entered. However it is still noteworthy since information is exposed by the kernel. Should this be added to the documentation?

Maybe this is totally a non-issue since the auxv entry pointing to the vdso cannot be accessed? However the vdso is still alive in the address space.

I also wonder if using the vdso for clock_gettime might make sense to reduce the overhead. Right now the spt binding just uses the clock_gettime syscall.

mato commented 5 years ago

Yeah, I know. My plan here is to try and see if I can get the spt tender to unmap everything from its process address space before jumping into the guest. That would include all of the (tender's) libc and vdso. This requires an asm trampoline inside spt and some caching of the eBPF filter to work correctly (since once you munmap() libc and friends you're on your own). This is also why I've not bothered with clock_gettime via vdso.

I'll look into this in the coming weeks.

minad commented 5 years ago

I would try to avoid libc in the solo5 spt tender if feasible. This would avoid the complications of unloading stuff. In my prototype in #343 I am avoiding libc. However I am doing a bit less there (no elf image loading). I am using vdso for clock_gettime and bpf directly (ebpf cannot be used for seccomp yet unfortunately).

mato commented 5 years ago

There are other reasons for keeping the tender for spt, I don't expect that to go away any time soon. Among other things, the recently merged build changes open the door to sharing code between tenders which is also useful.

Using libc in the tender is not complicated, it just requires careful "planning" before launching the guest.

mato commented 3 years ago

479 opens up a path to fix this, since we now have the generated BPF seccomp filter available and load it directly. Rough sketch of how it could be done:

We need to determine how best to unmap "as much as is practical" without jumping through too many hoops. One option would be to parse /proc/self/maps or use dl_iterate_phdr(), with the goal of unmapping all shared objects. However, we can rely on the documented behaviour of munmap(addr, size), which is that it will unmap all pages from all mappings in the range (addr ... addr+size).

spt_run() can compute a suitable page-aligned range from (&_end ... %rsp). This range covers everything from the end of the tender executable as loaded by the kernel to the current stack, which, given the internals of Linux memory layout, at least on x86_64 covers all the shared objects, vdso, and ld.so itself. It would then pass that to on to spt_launch() along with the BPF filter.
spt_launch() would have to be modified to do exactly three things in very careful C with direct syscall invocation and/or hand-written assembly, aborting if any step failed:
- munmap(addr, size) - unmaps everything from the process except the tender code, data, stack and guest memory in a single call.
- seccomp(...) - loads the seccomp filter.
- transfer control to the guest.

Solo5 / solo5

spt - vdso calls might not be filtered #350

479 opens up a path to fix this, since we now have the generated BPF seccomp filter available and load it directly. Rough sketch of how it could be done: