implement shadow callstack and evaluate perf

derekbruening commented 9 years ago

From bruen...@google.com on December 20, 2011 10:57:37

What is the problem to solve? Why is it important? Provide some context for those unfamiliar with the details of the system. We would like to use faster app builds for better performance. But while we can disable inlining and FPO in cl, there's no flag to disable tailcalls, so we can still end up with missing frames, cause suppressions not to match and confusing developers. Also, while most system libraries on Windows are not built with FPO, we would like to be able to handle those that are. And ideally if we could handle an app built w/ full opts including inlining and FPO we'd like best-effort callstacks to be sufficient, though we're not willing to give up much perf to get that. What are the possible approaches to solving the problem? Xref issue #703 , issue #711 , issue #557 . This issue covers implementing a shadow callstack and measuring the perf hit. Issue 703 suggests using shadow callstack for malloc-intensive apps (many callstacks) and callstack walking for all other apps. We could also have runtime control (if we can't dynamically detect an app with FPO) so a user building with FPO can get callstacks (at a perf hit that in fact might outweigh the benefit of FPO: but there would still be the advantage of not having a separate build, modulo uninit false pos from opts). Which approach is being taken and why? First, implement and measure. Then make some decisions on when to use it. Any interesting details or challenges of the implementation? Fairly straightforward. Implemented already in many other tools. Longjmp, SEH, etc. need to be handled, usually by storing app sp.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=724

derekbruening commented 9 years ago

From rnk@google.com on December 20, 2011 09:06:24

I'd like to take ownership of this, unless anyone else has started it or spent time thinking about it.

Owner: rnk@google.com

derekbruening commented 9 years ago

From bruen...@google.com on April 04, 2012 10:18:36

xref issue #855

derekbruening commented 9 years ago

From zhao...@google.com on May 10, 2013 11:20:43

One way to do the efficient shadow stack for callstack:

for each thread, we allocate a shadow stack with the same size as application stack and initialized it all zero.
the offset between app stack and shadow stack is stored in a TLS
on every call instruction, we put return address onto the shadow stack, something like: mov [TLS] => r1 ; mov ret-addr => [xsp, r1 , -8]
on every return we clear the return addr on the shadow stack: something like mov [TLS] => r1 ; mov 0 => [xsp, r1 ]
on every callstack query, we can just scan the shadow stack for callstack construction

The performance should be ok as it would be 2-4 instructions for each call/return, no eflags stealing The accuracy should be good as it has only return address on the shadow stack. The shadow stack can also be moved around when necessary by simply updating the [TLS]. Memory overhead might be a concern, which double the stack size, should be fine.

For signal or callstack, we can switch the shadow stack if necessary.

Owner: zhao...@google.com

DynamoRIO / drmemory

implement shadow callstack and evaluate perf #724