draios / sysdig

Linux system exploration and troubleshooting tool with first class support for containers
http://www.sysdig.com/
Other
7.76k stars 728 forks source link

Sysdig with script introspection #1421

Open kristopolous opened 5 years ago

kristopolous commented 5 years ago

I've had an idea for a long time and I know sysdig is orders of magnitude larger than when I last had serious engagement but I don't see any evidence of the idea having been implemented. I totally want to build it but if you already have it, it'd be even better!

I've thought about integrating with scripting debuggers and then interleaving timestamps. Is there work on this front? Essentially it would be pdb in python, xdebug in php, inspect in node, etc.

The problem I've always run into is that I live 99% in the interpreted world and there's always been an impenetrable wall between directly correlating system issues and their corresponding script lines beyond guesses and hunches.

So what would this look like? A few things, let's go over the 3 big visions.

the grand unified tracefile

What if I could set breakpoints and watches using some kind of sysdig logic that engages with the debugging subsystems of the languages. Or what if I wanted to do statistical correlation and find where a problem was. So I wanted to record where the code is being run when an event happens and then go from there? All these processes use the same system clock so latent analysis would be fine as long as we're careful with the timestamping.

A kind of deep introspection hook using some as-of-yet undescribed hack with the scripting debugger where there's potentially two way instrumentation between sysdig and the hacked debugger would be able to achieve this, raise alerts, or do other things.

Another use-case is the very large library of dependencies in say, a node project. Many framework-dependent node projects will have 500 modules. This one project I'm working on has a sloccount of over 1.5 million lines in its node_modules directory. Maybe I'm just lazy, but there's no way I'm going to study all that in order to infer a cause of a problem through intuition.

However, something like this would allow you to shave away the many underlying layers of framework hocus pocus if there's a real world problem that needs addressing instead of trying to spend all the time on stackoverflow and github comment threads.

the interactive every process debugger

It also, so long as there's an interactive REPL, avoids the static analysis problem that many of these dynamic languages with reflection have (a fundamental inability to do reasonable code navigation) since you'd be able to essentially replay resource allocation and utilization along with stepping through the code in a unified capture file .

I'm imagining a system with different kind of "injection modules" for languages that work similar to emacs GUD that can do rotating in-memory buffers capable of ex-post-facto snapshots to get the context of an event after a trigger has hit so someone can analyze at their leisure and then set breaks if they want, either from sysdig or from their debugger of choice.

It can get more wild ... on many systems there's a bunch of code running, imagine one that can use sysdig as the instrumenter to find the offending code, then use the injection interface to drop into a REPL on the offending code - we've moved from process debugger to system debugger.

the magic heatmap

Imagine a heatmap, with the first column being your requests, classnames, top level functions, events, or whatever your design is and then every subsequent column being a different metric of measurement ... now step-2 is they become treeview controls where you can expand the nodes and look into what they invoke to find the issues.

Now you can use this system of a tree-viewed heatmap to set the triggers described above so you can do your analysis of choice relative to the problem at hand.

Is there an overhead cost here? Of course ... but the technology would be amazing and totally worth it.

bonus for reading this far: the spy-board

If you wanted to truly "productize" it there may be some very clever hardware that can be added over say a PCI-E or M.2 slot (think ARM+nvme or maybe something like Qualcomms SMD) that can effectively be a "system-tap" to handle the vast majority of the overhead here. Toss an RJ-45 on it and you can even create a shadow debugging network for those cards to communicate to each other with (mqtt or something).

So you have a developer sitting at her desk and she gets an alert that her hiesenbug was triggered some server somewhere. She clicks a button and is immediately dropped into a remote debug shell on the process with sysdig analysis hooks so she can step through code, see what happened and fix the issue.

kristopolous commented 5 years ago

Thanks to a discussion on slack, the core parts of these features may be covered in tracers https://github.com/draios/sysdig/wiki/Tracers

kristopolous commented 5 years ago

After further investigation, tracers don't get me to the promised land. It's a nice tool but it's not this

gnosek commented 5 years ago

Another angle potentially worth investigating would be uprobes (ideally USDT-based to keep compatibility across builds without debug symbols).

Having USDT probes converted to sysdig tracers would get you close to your promised land I think :)

fntlnz commented 5 years ago

@gnosek is your suggestion that what @kristopolous described could be integrated with USDT to make the "system debugger" better at understanding what is the actual path to follow to debug a given feature.

Also, interpreted programming runtimes like Node have built-in systemtap definitions to make even easier to do this kind of thing.. -> https://github.com/nodejs/node/blob/master/src/node.stp

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.