draios / sysdig-inspect

Sysdig Inspect - A powerful opensource interface for container troubleshooting and security investigation
https://www.sysdig.org/
Other
970 stars 69 forks source link

'SYSCALLS' view doesn't show all syscalls #41

Open zoehneto opened 6 years ago

zoehneto commented 6 years ago

When viewing the syscalls from a capture file in sysdig-inspect there are many syscalls missing. For example I can see the first 6800 syscalls and then it suddenly jumps to syscall 30400. If I run the csysdig command used by sysdig-inspect in the terminal and pipe the result to a file, I can see that there are many relevant syscalls between 6800 and 30400 which are not visible in sysdig-inspect.

The executed command is:

csysdig -r <capture-file> -v dig -j -pc "evt.type != switch"
davideschiera commented 6 years ago

@zoehneto that's correct, all switch events are not exported in Sysdig Inspect (as well as in csysdig). The reason is that Sysdig Inspect shows all the system calls, and switch is not a system call. You might find more information about it at https://sysdig.com/blog/interpreting-sysdig-output/ (The Switch Event section).

To confirm, you can try running sysdig on your capture to list only switch events and verify the missing entries in Sysdig Inspect.

Does this help?

zoehneto commented 6 years ago

I'm aware that switch events aren't displayed by default, since they are filtered out by the filter expression. But it is not the switch events I'm missing, it is actual syscalls. In my case this means that all syscalls for one process are missing in sysdig-inspect, while they are included in the command line output.

davideschiera commented 6 years ago

Could you provide a sample capture file and a list of system calls you are missing?

zoehneto commented 6 years ago

I can't share my own capture files, but it is reproducible with the sample 404Error.scap from the repo:

If I open the capture file and click on 'SYSCALLS' in the lower left hand corner I get pretty much all syscalls upto and including syscall 1414, but after that the next syscall is 2002:

 ...
 1412 23:07:23.868251252 0 host (host) /usr/bin/termin (2452:2452) > recvmsg fd=4(<u>) 
 1413 23:07:23.868251589 0 host (host) /usr/bin/termin (2452:2452) < recvmsg res=-11(EAGAIN) size=4096 data= tuple=NULL 
 1414 23:07:23.868254880 0 host (host) /usr/bin/termin (2452:2452) > poll fds=5:e1 4:u1 9:p3 3:u1 18:p1 20:f1 30:f1 28:f1 32:f1 14:f1 26:f1 34:f1 24:f1 22:f1 41:f1 36:f1 39:f1 timeout=14 
 2002 23:07:23.883110597 1 host (host) sublime_text (2523:2523) < poll res=0 fds= 
 2004 23:07:23.883118625 1 host (host) sublime_text (2523:2523) > recvmsg fd=3(<u>) 
 2005 23:07:23.883119486 1 host (host) sublime_text (2523:2523) < recvmsg res=-11(EAGAIN) size=4096 data= tuple=NULL 
 ...

According to the console output of sysdig inspect, it is running the command:

spawning /usr/lib/sysdig-inspect/resources/app/ember-electron/resources/sysdig/csysdig with args: -r,/home/zoehneto/Downloads/404Error.scap,-v,dig,-j,-pc,evt.type != switch

If I then run this myself as:

/usr/lib/sysdig-inspect/resources/app/ember-electron/resources/sysdig/csysdig -r /home/zoehneto/Downloads/404Error.scap -v dig -j -pc "evt.type != switch" > test.txt

I get the following output, which includes plenty of syscall between 1414 and 2002 (newlines added by me):

"1412 23:07:23.868251252 0 host (host) /usr/bin/termin (2452:2452) > recvmsg fd=4(<u>) ",
"1413 23:07:23.868251589 0 host (host) /usr/bin/termin (2452:2452) < recvmsg res=-11(EAGAIN) size=4096 data= tuple=NULL ",
"1414 23:07:23.868254880 0 host (host) /usr/bin/termin (2452:2452) > poll fds=5:e1 4:u1 9:p3 3:u1 18:p1 20:f1 30:f1 28:f1 32:f1 14:f1 26:f1 34:f1 24:f1 22:f1 41:f1 36:f1 39:f1 timeout=14 ",
"1415 23:07:23.868256579 0 host (host) /usr/bin/termin (2452:2452) < poll res=1 fds=5:e1 ",
"1416 23:07:23.868270321 0 host (host) /usr/bin/termin (2452:2452) > recvmsg fd=4(<u>) ",
"1417 23:07:23.868270678 0 host (host) /usr/bin/termin (2452:2452) < recvmsg res=-11(EAGAIN) size=4096 data= tuple=NULL ",
"1418 23:07:23.868273405 0 host (host) /usr/bin/termin (2452:2452) > poll fds=5:e1 4:u1 9:p3 3:u1 18:p1 20:f1 30:f1 28:f1 32:f1 14:f1 26:f1 34:f1 24:f1 22:f1 41:f1 36:f1 39:f1 timeout=14 ",
"1419 23:07:23.868275143 0 host (host) /usr/bin/termin (2452:2452) < poll res=1 fds=5:e1 ",
"1420 23:07:23.868275466 0 host (host) /usr/bin/termin (2452:2452) > read fd=5(<e>) size=16 ",
davideschiera commented 6 years ago

What version of Inspect are you using? And what OS?

zoehneto commented 6 years ago

I'm using version 0.3.1 on arch linux.

davideschiera commented 6 years ago

The following links are for the latest build on dev for the upcoming 0.3.2. I could reproduce the same issue you see on Mac with version 0.3.1 but not with 0.3.2. Would you mind giving it a try and see if it is solved for you too?

https://download.sysdig.com/dev/sysdig-inspect/dev/installers/linux/sysdig-inspect_0.3.2_amd64.deb https://download.sysdig.com/dev/sysdig-inspect/dev/installers/linux/sysdig-inspect-0.3.2.x86_64.rpm

Keep in mind that these are dev builds, the final release might change. Thanks!

zoehneto commented 6 years ago

It is indeed fixed in the dev build.

The view now seems to be limited to 3000 entries (probably for performance reasons) and while I can get the next entries by moving the time slider at the bottom, it would be really helpful if there was any visual indication that there are more syscall which are currently not displayed. Maybe also add a button which would allow to switch directly to the next 3000 syscalls, since scrubbing to the exact right point is quite difficult (especially with larger files, where it can be quite slow).

davideschiera commented 6 years ago

Glad to hear that it works! Or at least the behavior is more predictable. My thinking is that the original bug was causing some entries to be replaced by following ones (e.g. 1200-1300 replaced by 2000-2100 while parsing data).

And yes, I agree, an indicator and a way to show following "pages" would be useful. That might require a little bit of work that might not go in the 0.3.2.

I'd also add that the ideal workflow of Inspect is not going through the raw list of system calls, which is where sysdig on command line shines, but rather using higher level views (processes, files, etc.) to remove "noise" and eventually go to the raw list of events when no views would give meaningful information. Doing so, I wonder if the limit of 3,000 events would still be blocking (for most of the cases, I mean). It'd be very useful to know more about your use case!

I'd keep the issue open to address the lack of visual pagination. Thanks for reporting the issue!

zoehneto commented 6 years ago

My personal usecase is that I am building a machine learning model around the data in the trace and the syscall view allows me to get at the raw data to see whats really there, in what order syscalls are executed and what their parameters are. It can then be necessary to skip through many thousands of lines when I know an event I'm interested in happens a couple of seconds into the trace, but I don't know exactly when and other processes are quite noisy.

Obviously the other views are very helpful and many things can also be achieved with filters, but sometimes it is just nicer to get directly at the raw data without aggregation, with all the information available and in the concrete time context.

Anyways, thanks very much for this tool, it makes exploring the data sysdig provides just a lot easier and faster :)

davideschiera commented 6 years ago

@zoehneto The issue should be fixed in version 0.3.3.

As mentioned, I'll keep the issue open specifically for the lack of indication that only the first 3k entries are displayed and way to see more.