Add mechanism to log syscalls

wkozaczuk commented 1 year ago

With upcoming changes to OSv to support running statically linked executables or dynamically linked executables bootstrapped by Linux dynamic linker, it would be nice to have a mechanism to enable logging of what syscalls are called and with what parameters and what result they return.

This could be as simple as what this patch fragment does:

-#define SYSCALL2(fn, __t1, __t2)            \
+#define SYSCALL2(fn, __t1, __t2, __fmt)     \
         case (__NR_##fn): do {              \
         va_list args;                       \
         __t1 arg1;                          \
@@ -252,6 +253,7 @@ long long_mmap(void *addr, size_t length, int prot, int flags, int fd, off_t off
         arg1 = va_arg(args, __t1);          \
         arg2 = va_arg(args, __t2);          \
         va_end(args);                       \
+        tprintf("syscall", logger_debug, %s " __fmt "\n", #fn, arg1, arg2); \
         return fn(arg1, arg2);              \
         } while (0)

@@ -490,19 +492,19 @@ OSV_LIBC_API long syscall(long number, ...)

     switch (number) {
 #ifdef SYS_open
-    SYSCALL2(open, const char *, int);
+    SYSCALL2(open, const char *, int, "pathname:%s, flags:%d");
 #endif

Ideally, we should make it so it does not affect the syscall performance if syscall logging is disabled. Maybe logging would be only compiled based on some #ifdef variable instead of being dynamically enabled by a new kernel flag.

Another alternative would to be use tracepoints, but I am afraid adding ~150 new tracepoints (or 300 including the *_ret ones) will increase kernel size significantly and affect the performance.

nyh commented 1 year ago

Tracepoints are explicitly designed for having near-zero performance impact when not enabled (add a single nop instruction). I don't remember how much they increase the kernel size - I don't think it's much but you can try adding 150 silly tracepoint calls. You can also make it perhaps just one tracepoint called in 150 different places.

wkozaczuk commented 1 year ago

Putting aside any impact on kernel size, one issue I have with tracepoints is one cannot easily see the results they capture as things happen which I can do with strace. One has to connect with gdb to see things after the fact. In which case some tracepoints may be gone because I understand they use some sort of ring buffer and new tracepoints at some point would roll over the old ones, no?

It would be nice to enhance tracepoints to maybe add a thread that could print them to the standard out as they happen and flush they right away.

nyh commented 1 year ago

The idea with tracepoints is that they let you do three things efficiently:

If a tracepointed event is fairly rare, it allows you to see if it has happend and when and how, in that buffer.
If, on the other hand, it is very common, you can count the tracepoint very efficiently, without saving the events (we even have a script to do this, and it doesn't need gdb).
If the event is very common and you still need examples, you probably can't hope to see all the events ever (this will be huge) so seeing a buffer of the last events can be useful too.

A buffer of the last events is especially useful when debugging a crash, so the last events are probably (maybe) the ones that caused the crash.

nyh commented 1 year ago

Though I admit that for fairly rare events, adding an strace-like command can be useful.

nyh commented 1 year ago

By the way, in case you haven't noticed it (I think that you did, but just in case...) the script I mentioned above is scripts/freq.py, which uses the REST API (not gdb) to connect to OSv and get the frequency of certain tracepoints. For example - get the number of context switches per second, number of system calls per second, and so on. It's fine that these numbers are huge (e.g., 1 million per second) - it's still very efficient. The script enables counting on the requested tracepoints, and then periodically reads those counts.

We also have in the REST API (see modules/httpserver-api/api-doc/listings/trace.json) the ability to enable full/random/backtraced tracing of events and retrieve the full trace buffer - so if the http server is included in the image, you don't need to use gdb. I think (?) we have in scripts/trace.py code to parse the content of this buffer (I think you yourself documented this script, in https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py).

cloudius-systems / osv

Add mechanism to log syscalls #1261