GetFunctionEnter3Info does not return correct argument address on Unix

sywhang commented 6 years ago

This is an issue I found while investigating an issue reported in this thread: https://github.com/dotnet/coreclr/issues/18977

On Linux, GetFunctionEnter3Info does not return the correct argument addresses (It returns correct numRanges and totalArgumentSize). I've successfully verified this issue on x64 (Ubuntu), but I have yet tried to repro this on other architectures (arm).

The sample profilee is a very simple program with one function call.

    class Program
    { 
        static int foo(int a, int b, int c, string d)
        {
            return a + b + c + d.Length;
        }

        static void Main(string[] args)
        {
            Console.Write("foo: {0:d}\n", foo(0xaa, 0xbb, 0xcc, "bar"));
        }
    }
}

I put a breakpoint on ProfilingGetFunctionEnter3Info in the runtime right before it returns. On Windows, WinDbg shows the following when I inspect the memory address pointed by pArgumentInfo->ranges[0].startAddress:

00000055`585be2b0 aa 00 00 00 00 00 00 00 bb  .........
00000055`585be2b9 00 00 00 00 00 00 00 cc 00  .........
00000055`585be2c2 00 00 00 00 00 00 80 0e 13  .........
00000055`585be2cb a7 6d 01 00 00 00 00 00 00

As shown, the memory indeed contains the arguments that was passed to the function foo.

In Linux, lldb shows me the following:

(lldb) frame variable
(FunctionID) functionId = 140735268410280
(COR_PRF_ELT_INFO) eltInfo = 140737488337952
(COR_PRF_FRAME_INFO *) pFrameInfo = 0x00007fffffffb8f0
(ULONG *) pcbArgumentInfo = 0x00007fffffffb8ec
(COR_PRF_FUNCTION_ARGUMENT_INFO *) pArgumentInfo = 0x00000000013a9120
(COR_PRF_ELT_INFO_INTERNAL *) pELTInfo = 0x00007fffffffbc20
(MethodDesc *) pMethodDesc = 0x00007fff7bae53a8

(lldb) x/20x 0x00000000013a9120
0x013a9120: 0x00000004 0x00000014 0xffffbf50 0x00007fff
0x013a9130: 0x00000004 0xcccccccc 0xffffbf58 0x00007fff
0x013a9140: 0x00000004 0xcccccccc 0xffffbf60 0x00007fff
0x013a9150: 0x00000004 0xcccccccc 0xffffbf68 0x00007fff
0x013a9160: 0x00000008 0xcccccccc 0xcccccccc 0xcccccccc

(lldb) x/20x 0x00007fffffffbf50
0x7fffffffbf50: 0x85af6e00 0xffc21e02 0x00000000 0x00000000
0x7fffffffbf60: 0xf58eedb3 0x00007fff 0x54021b68 0x00007fff
0x7fffffffbf70: 0xffffe480 0x00000001 0x54021b50 0x00007fff
0x7fffffffbf80: 0x85af6e00 0xffc21e02 0xffffe060 0x00007fff
0x7fffffffbf90: 0xffffbfb0 0x00007fff 0xf5bbfc83 0x00007fff

As shown above, the argument address (0x00007fffffffbf50) is does not contain the function arguments (0xaa, 0xbb, 0xcc, "bar")

AaronRobinsonMSFT commented 6 years ago

cc @noahfalk

sywhang commented 6 years ago

Just verified that this is broken on OSX as well, which makes me point to calling convention as the culprit for this issue. Will update as I investigate further.

dotnetjt commented 6 years ago

Any updates on this one?

sywhang commented 6 years ago

@dotnetjt I've identified the root cause - it has to do with ProfileArgIterator::ProfileArgIterator not respecting the calling convention difference between Windows/Unix systems (more specifically, when we pass arguments in registers instead of putting them on the stack). I am currently a little blocked with other work items in the GC space, but the fix for this will definitely make it into 3.0 preview.

dotnetjt commented 6 years ago

Any updates on this one @sywhang ?

sywhang commented 6 years ago

@dotnetjt sorry for the late response - I've been blocked with other work items on my plate. I have a fix that works for most cases but fails on one of the tests I've written. I'll send out a PR once I get around to fixing my bug.

mattwarren commented 5 years ago

@sywhang as I mentioned in https://github.com/dotnet/coreclr/issues/18977#issuecomment-482041963, I'm also blocked on this issue, is there any ETA on a fix? (I see the bug when running against a local build of the CoreCLR, as-of commit 5d57b87)

janvorli commented 5 years ago

I have taken a brief look at the code and the thing that would complicate the implementation is the fact that on Windows, all the arguments are stored in memory in a sequential layout, so we can just get pointers to them and stick them into the pArgumentInfo->ranges[count].startAddress. On amd64 Unix though, structs can be passed in registers in a complex way, possibly being split between general purpose and xmm registers. So we will need to copy them to some memory location in a sequential manner and then store a pointer to that location into the pArgumentInfo->ranges[count].startAddress.

mattwarren commented 5 years ago

I spent a bit of time looking at the code, trying to understand what was going on, just for my own education. The most informative thing I found was this comment https://github.com/dotnet/coreclr/blob/3271f5066cafd8764b5bcf24e3bfa3bdcb351dc1/src/vm/callingconvention.h#L18-L34

Does that not indicate that callingconvention.h, which the Profler calls into (e.g. here) handles most/all of this complexity that's present on amd64 Unix? Or am I mis-understanding it?

janvorli commented 5 years ago

The ArgIterator in callingconvention.h does handle the complexity. However structs that are passed in a combination of general purpose and xmm registers are not stored in a consecutive block of memory. For this case, the ArgIterator returns special offset (TransitionBlock::StructInRegsOffset). The ArgDestination has methods for getting location where the argument is stored. For structs passed in registers, there are two methods - GetStructFloatRegDestinationAddress and GetStructGenRegDestinationAddress. For other argument kinds, there is GetDestinationAddress. The IsStructPassedInRegs method is used to figure out the kind of argument.

The GetFunctionEnter3Infofunction reports start address per each argument and so such argument needs to be stored sequentially in memory. So for structs passed in both general purpose and xmm registers, we would need to create a sequential copy somewhere in memory and pass address of that in pArgumentInfo->ranges[count].startAddress.

The ArgDestination has a CopyStructToRegisters method that is used to copy a struct from sequential in-memory storage into the appropriate general purpose / xmm registers in the TransitionBlock. We would need to add a method that does the reverse to support getting a sequential copy of the arguments.

It seems we could also just modify the behavior of the GetFunctionEnter3Info on Unix so that it would return upto two ranges for structs passed in registers. That would eliminate the need for creating the sequential copy. However, that may break assumptions that the profiler developers have, so I am not sure if that's feasible. The MSDN doc says:

So, for a single function, you have a single COR_PRF_FUNCTION_ARGUMENT_INFO structure, which references multiple COR_PRF_FUNCTION_ARGUMENT_RANGE structures, each of which points to one or more function arguments.

The "points to one or more function arguments" seems to indicate that it is not expected to point to a "half" of an argument.

mattwarren commented 5 years ago

@janvorli thanks for the in-depth answer, it makes sense now (I didn't fully appreciate that a single struct argument could be split across registers and the stack, yeah that sounds tricky)

The ArgDestination has a CopyStructToRegisters method that is used to copy a struct from sequential in-memory storage into the appropriate general purpose / xmm registers in the TransitionBlock. We would need to add a method that does the reverse to support getting a sequential copy of the arguments.

I'd come across ArgDestination when I was looking around for other callers of GetNextOffset() (in callingconvention.h), I just hadn't appreciated that the other usages, e.g RuntimeMethodHandle::InvokeMethod (here) were doing the opposite of what the Profiler needs to do. I assume that the Profiler is the only part of the runtime the needs to do things this way round (i.e. copying args from registers/stack -> a struct.

It seems we could also just modify the behavior of the GetFunctionEnter3Info on Unix so that it would return upto two ranges for structs passed in registers. That would eliminate the need for creating the sequential copy. However, that may break assumptions that the profiler developers have, so I am not sure if that's feasible. The MSDN doc says:

I can't speak for everyone who's using the Profiling API, but anything that makes it easier to use is better, it already requires pretty deep knowledge, i.e. 'naked' functions, writing some assembly to preserve registers and knowing about calling conventions etc, thanks goodness for David Broman's posts and sample code!!

So I'd obviously prefer to be able to rely on the arguments being in one location in memory. Also it means that less code has to change between a Windows and Linux version.

mattwarren commented 5 years ago

BTW, just to confirm that when you have a method with a this pointer, the callback passes the correct address in pArgumentInfo->ranges[0].startAddress, i.e. it appears that this code is doing the right thing (I guess the Unix ABI convention around that is more straight-forward compared to arguments)

noahfalk commented 5 years ago

@davmason

Looks like you guys have already deciphered most of the challenges here (thanks!). I just wanted to add a couple things: 1) In ProfileEnter we probably do have the wiggle room to copy certain arguments to the stack first so that they can have contiguous memory ranges, then we could report that to the profiler. 2) We've been hearing a bit of feedback that ELT is hard to use over the years and the new architecture ports feel like they are exacerbating this. @davmason and I have been exploring alternatives that profilers might prefer better. The primary one we are looking at is what we could do with IL rewriting. We planned to write something a little more detailed to get feedback, but this current issue seemed like a good oportunity while the topic is front of mind. We know IL rewriting is no cakewalk as-is but it has some nice properties of being architecture neutral, in many cases the final assembly is faster, and it is far more extensible if your needs grow beyond just ELT. We were wondering if there are things we could do to lower the difficulty curve there so that it feels like a clearly better option (more samples? docs? pre-canned higher level API abstractions?). As an ELT user we'd love to get your thoughts!

tommcdon commented 5 years ago

@davmason can you take a look?

mattwarren commented 5 years ago

@noahfalk sorry for the late reply, I've been working on other things and only just getting back to this issue now

Here's my general thoughts, having worked on a IL rewriting profiler and an ELT Hook, metadata-based one (using an approach similar to the one you described in https://github.com/dotnet/docs/issues/6728#issuecomment-409394385)

Full caveat, in both cases the decision to go with IL re-writing v. ELT Hooks & Metadata was made before I joined the projects. In addition the initial code had already been implemented, so I can't take any credit for the decision and/or initial work, I just had the job of continuing the development. Also, I've only ever worked on APM tools, so my perspective of using the Profiler API may be different from other tools.

We were wondering if there are things we could do to lower the difficulty curve there so that it feels like a clearly better option (more samples? docs? pre-canned higher level API abstractions?).

I think it's a really good initiative, I'd love to read your write-up on how you would make things easier. Working samples are one of the most helpful things I've found, for instance this one saved me lots of time when porting an existing Profiler from Windows to Linux.

However I suspect that any attempts to get people to move over to an ELT approach (by providing 'pre-canned higher level API abstractions') are going to run into the same arguments that came up in https://github.com/dotnet/corefx/issues/24390. I.e. for existing products, there will have already been a fair amount of development work, so any replacement needs to be very compelling, offering a clear advantage (i.e. efficiency, speed, etc) to make the changeover worthwhile. For new products it's a different argument, but I don't know how often Profiling API based products are started, I would imagine it's a fairly niche/limited field, in my experience there must be less than 100 existing tools that use it? (maybe much less than that?)

Having said that, I've always thought that if you were able to do all of this from managed code (like the Java Agent API allows) it may open up the amount of tools that make use of it. But I don't know if you're proposing something as radical as that!!

We know IL rewriting is no cakewalk as-is but it has some nice properties of being architecture neutral, in many cases the final assembly is faster, and it is far more extensible if your needs grow beyond just ELT.

One question, is IL re-writing faster? ('in many cases the final assembly is faster') i.e. is the cost of an additional managed method call, plus packing up the function arguments (so they can be used/inspected in callback), less than the cost of pulling out 'function arguments' and 'return values' via the metadata API directly (in un-managed, C++ code)? I've not measured the 2 side-by-side and I guess it would be hard to do accurately, but I would've thought that it's not so obvious that IL-rewriting is faster?

Finally, unless I'm mistaken, you can get quite a long way with a ELT/metadata based approach. The main difference I've found is that with IL re-writing you can then call managed methods in the probe/callback, whereas the other approach only allows access to static data, i.e. function arguments, return values and their fields. So IL re-writing is useful if you want to get information that is only accessible via a Property for instance, i.e. you can't just access it directly from a field. But, when you do this you are potentially causing side-effects (by calling a method), so it can have downsides.

Hope this all makes sense? (I'm not sure if it help clarify anything though?! 😊)

mattwarren commented 5 years ago

In ProfileEnter we probably do have the wiggle room to copy certain arguments to the stack first so that they can have contiguous memory ranges, then we could report that to the profiler.

@noahfalk / @tommcdon / @davmason

Are you able to say if this will be fixed in .NET Core 3.0 or is that not something you can comment on?

davmason commented 5 years ago

@mattwarren

Are you able to say if this will be fixed in .NET Core 3.0 or is that not something you can comment on?

Right now it's on the backlog, which means we'd like to fix it but nothing is guaranteed. Since we are relatively late in the 3.0 cycle anything on the backlog has a realistic risk of not getting fixed. If you let me know how this impacts you (i.e. without this will you have to do a workaround or will you be unable to make your profiler work at all on linux), then it gives us better information to prioritize with.

noahfalk commented 5 years ago

@mattwar thanks a bunch for the feedback and no worries on the timing! Its a long-lead idea just simmering at this point. And not to worry, your feedback very much makes sense - I read it primarily to say profilers may have a big investment on a particular tech and so changing to anything else means migration cost if the new thing isn't highly compatible with the old thing. Totally reasonable.

Having said that, I've always thought that if you were able to do all of this from managed code (like the Java Agent API allows) it may open up the amount of tools that make use of it. But I don't know if you're proposing something as radical as that!!

We've kicked around the idea of doing IL-rewriting via managed code in the past, it doesn't feel crazy : ) A bare bones approach could be as simple as creating a COM-interop wrapper for the ICorProfiler API that already exists (in fact any profiler writer could implement this). Where things get harder is if you want to eliminate the rule that there is only one profiler component that is allowed to call the API. If I recall Java's approach is not to enforce any barrier and if different callers can't agree on the bytecode for a method someone will win and someone will lose. The other option we've dabbled with is to offer a limited set of instrumentation options such as "Give me a callback here", "Capture this variable value" or "Wrap the function with some exception handling". Although not as free-form as IL-rewriting, all of these operations are composable so any number of instrumentation consumers could make requests at the same time. If this kind of thing feels intriguing and you'd like to explore it more we can open a new issue for it.

One question, is IL re-writing faster? ('in many cases the final assembly is faster') i.e. is the cost of an additional managed method call, plus packing up the function arguments (so they can be used/inspected in callback), less than the cost of pulling out 'function arguments' and 'return values' via the metadata API directly (in un-managed, C++ code)? I've not measured the 2 side-by-side and I guess it would be hard to do accurately, but I would've thought that it's not so obvious that IL-rewriting is faster?

I'll admit, I only have an educated opinion which isn't as reliable as concrete perf measurement. There are a couple things that favor IL-rewriting: 1) In the fastest/ideal case, your managed arguments aren't ever going to transit to native code. You write your instrumentation in managed code, either directly with inline IL or with a call to instrumentation function that is also managed. The JIT is then free to apply its full set of code-gen optimizations to your instrumentation. For example it could inline the call to the instrumentation callback so no call-frame setup ever occured. ELT always pays for call setup/teardown. 2) In the not quite as fast case where you want the instrumentation at least partly in native code but still strongly typed, you can define a p/invoke method to receive that call with strongly typed arguments. P/invoke is a path that gets a lot of performance attention so the JIT is going to generate pretty efficient code to marshal those arguments. The ELT callout isn't perf optimized when using the argument providing version so you get a callstack like this:

Your profiler ELT callback function
Runtime implemented C++ function that sets up the call to your ELT callback
Runtime implemented assembly function that preserves registers to match C++ call convention
Jitted code

That call sequence probably costs at least 20 instructions, and then your C++ function still has buffers of argument pointers that it must do further processing on to recover the value of the arguments.

3) Instrumentation that is weakly typed and transitions to native code - I'd expect this to be the slowest case. I'm guessing the unpacking in native code will be the same in either case and if it is interpreting metadata on the fly that is likely to be the dominant cost of the whole operation. But in a comparison of the runtime ELT assembly/C++ code packing your arguments vs. some IL packing the arguments our runtime assembly/C++ helper for this is pretty naive. As long as you generated some IL that didn't box all the value types when it stuffs them in an array I would place my bet on the jitted IL being faster.

Finally, unless I'm mistaken, you can get quite a long way with a ELT/metadata based approach.

Yep, as you pointed out the main limitation is that you can't (easily) turn around and call managed code. If everything you wanted to do was implemented in native code this is probably a non-issue, if you wanted to invoke a property getter or write the arguments into an EventSource for logging that would be harder.

bbc2 commented 5 years ago

For what it's worth, since you seem to be interested in feedback in this thread, here's some more, based on our experience building a profiler at Cryptosense.

We analyze the cryptography of applications by logging calls they make to cryptographic APIs (like javax.crypto.Cipher.init or System.Security.Cryptography.Aes.Create). In Java, we use a JVM agent, also written in Java and with the help of Javassist to instrument methods of interest.

The .NET profiling API (for ELT or IL rewriting), unlike other techniques like compile-time instrumentation, has the advantage of being usable on applications that are already compiled to bytecode and running in an unmodified environment (that's why we chose that). However, it sits at a much lower level than a typical Java agent, which notably means that you can't call getters or even other methods on objects of interest (as mentioned earlier in this thread).

We made an ELT profiler. Unlike in a Java agent, everything you do is more dangerous (reading from memory with a C API and C++ code safety in general). But it works, except on Linux, unfortunately, because we need values of arguments and return values.

We considered doing IL rewriting, notably because we are used to doing that in Java, but it seemed harder to do in .NET, based on the sample application. Furthermore, even if we consider that enter/leave hooks are not that hard to do, we would still need arguments and return values.

I'm a bit new to .NET so it's highly possible that I've missed some obvious alternative, but at least you know one more use case for instrumenting applications and why the profiling API is attractive to us.

noahfalk commented 5 years ago

Furthermore, even if we consider that enter/leave hooks are not that hard to do, we would still need arguments and return values.

Thanks for the feedback! We should probably look into making a modified sample that shows how you could do the IL rewriting to capture arguments. It is entirely possible to do with the current IL rewriting API, the sample just isn't showing it.

mattwarren commented 5 years ago

@davmason

Right now it's on the backlog, which means we'd like to fix it but nothing is guaranteed. Since we are relatively late in the 3.0 cycle anything on the backlog has a realistic risk of not getting fixed. If you let me know how this impacts you (i.e. without this will you have to do a workaround or will you be unable to make your profiler work at all on linux), then it gives us better information to prioritize with.

The profiler that I'm working on will be without a large amount of its functionality if there isn't a fix. It will work, but as it extensively relies on extracting the arguments of a method call, it will be very limited. It would be incredibly valuable (to us) for you to be prioritizing this fix as otherwise we will have to spend significant effort looking at workarounds to cover for the lack of that functionality.

mattwarren commented 5 years ago

@noahfalk thanks for the detailed answer!

The ELT callout isn't perf optimized when using the argument providing version so you get a callstack like this:

Your profiler ELT callback function Runtime implemented C++ function that sets up the call to your ELT callback Runtime implemented assembly function that preserves registers to match C++ call convention Jitted code

That call sequence probably costs at least 20 instructions, and then your C++ function still has buffers of argument pointers that it must do further processing on to recover the value of the arguments.

Ah, good point, I'd completely forgotten about the overhead of the ELT callback itself, thanks for the reminder!

BTW on the topic of IL re-writing, I've recently come across another way that it can be done. Instead of re-writing a particular method, you can instead re-write all the call-sites to re-direct them to a wrapper function (C#) that analyses the arguments and then calls the real method. This means that the IL re-writing is a lot simpler.

I didn't come up with this approach, I stumbled across it in this project, see the following code:

noahfalk commented 5 years ago

Instead of re-writing a particular method, you can instead re-write all the call-sites to re-direct them

Does the technique forego handling all the cases that are not direct calls (I didn't get a chance to look through the implementation so forgive me if its all handled in some way in there)? I'm assuming these cases aren't handled: a) Virtual calls b) Interface calls c) Delegate calls d) COM-interop calls e) Reflection calls

mattwarren commented 5 years ago

@davmason Thanks for fixing this! I'll try it out in the next few days with the Profiler I'm working on

mattwarren commented 5 years ago

Does the technique forego handling all the cases that are not direct calls (I didn't get a chance to look through the implementation so forgive me if its all handled in some way in there)? I'm assuming these cases aren't handled: a) Virtual calls b) Interface calls c) Delegate calls d) COM-interop calls e) Reflection calls

@noahfalk good point. It only works with CALL or CALLVIRT IL instructions (see here), so I guess a few of those scenarios are missing. Although, I would imagine that you can get quite far with just direct/virtual/interface call re-writing. The other scenarios are a bit more niche when you are talking about methods that you would actually be interested in Profiling.

mattwarren commented 5 years ago

@davmason Just to confirm, I tried the latest code in the master branch and in my test app I now see the arguments passed in correctly via the COR_PRF_FUNCTION_ARGUMENT_INFO structure.

Thanks again for fixing this!

davmason commented 5 years ago

@mattwarren Great, thanks for confirming!

dotnet / runtime

GetFunctionEnter3Info does not return correct argument address on Unix #10951