__debug function performance

TurkeyMan commented 1 month ago

Are there any opportunities to optimise or improve the performance of these __debug functions? They are extremely slow. I have a custom string type and if a struct contains a few strings, stepping feels really sluggish, and if there's an array anywhere in view; it takes seconds to 10s of seconds each step.

Why are they so slow? It doesn't seem right that setting up the call should be that much trouble? If I can understand the performance characteristics, maybe I can make improvements on the app side...?

I have NOT enabled the "switch GC" option, since I am confident all my __debug functions are @nogc.

rainers commented 1 month ago

I shortly looked at the CPU diagnostics of displaying 100 of your Strings and the debugger mostly waits for the call to be performed in the other process. AFAICT this needs multiple inter-process-communication (devenv.exe <-> msvcmon.exe <-> debuggee.exe) so I suspect there is some inefficient waiting going on. Maybe that can optimized for arrays of objects with debug methods, but that will get messy e.g. if it is only one field of the array element that has an debugOverview method.

TurkeyMan commented 1 month ago

Hmmm, well, in its current state it's almost unusable. I think I'm going to have to turn the __debug feature off, because it's just too slow :/ Sadly, I don't think it's really reasonable to work without it either; not being able to inspect strings and arrays severely undermines the usefulness of a debug session.

If it's about ipc waiting, then finding any opportunities for batching up requests seems like the way to go... is the code structured in such a way that you can gather the requests rather than resolving them immediately, and then send them all to the debuggee in one batch? That might require a lib compiled into the debuggee offering a function which can receive a bundle of requests from the debugger, and then the debuggee might iterate the bundle and resolve them all in one big go? :/ So, rather than calling each debug function independently, call one global debug function from a lib, which would locally iterate the bundle of requests calling each debug function on each object in the bundle? If the lib is not linked, fall back on existing semantics. Should be simple to check if a global debugResolveBundle symbol is present in the binary. Might help the GC swapping too; just do it once around the bundle...

TurkeyMan commented 1 month ago

Maybe we could put a function in druntime that's present when building with symbols... at very least, I wouldn't be upset to link a lib.

rainers commented 1 month ago

The swapped GC is inside a DLL loaded into the target process, so that could contain helper functions. The calls by VS to evaluate locals or expressions are rather fine grained and probably not easily combinable (especially not delayed until more requests come in), but if an enumerator is returned to VS, you can predict that it is pretty likely that it will get enumerated to some extend, so the elements could be evaluated in larger chunks and be cached. Not a small change, though...

TurkeyMan commented 1 month ago

The swapped GC is inside a DLL loaded into the target process, so that could contain helper functions. The calls by VS to evaluate locals or expressions are rather fine grained and probably not easily combinable (especially not delayed until more requests come in),

Yeah, this was my concern. I can a Microsoft API being very event/response based... with an expectation of immediate responses :/

but if an enumerator is returned to VS, you can predict that it is pretty likely that it will get enumerated to some extend, so the elements could be evaluated in larger chunks and be cached. Not a small change, though...

I'm not sure I follow, but I'll take your word for it. Can the debugger accept and retain an enumerator object across requests, and how would that have an IPC advantage?

rainers commented 1 month ago

Yeah, this was my concern. I can a Microsoft API being very event/response based... with an expectation of immediate responses :/

On second look, the API does allow to return results asynchronously, so it could allow bundling multiple requests to the target process.

Can the debugger accept and retain an enumerator object across requests, and how would that have an IPC advantage?

For structs and arrays an enumerator is returned when asked for "children", and VS then calls something similar to GetItems(enumerator, start, count). Even if called individually for each child, these could return cached results filled from reading larger chunks.

On the other hand, calling a function in the process is set up via an abstract stack language that gets compiled into the target process. This currently happens for each evaluation. Maybe it is the compilation that takes most of the time (not the execution) and can be avoided by reusing the same compiled instruction sequence with only the this-pointer exchanged.

TurkeyMan commented 1 month ago

Okay, it sounds like there's multiple promising paths forwards.

rainers commented 3 weeks ago

FYI: I tried executing the same "compiled" instructions multiple times, but that won't help speeding up evaluation as each execution takes 40-50 ms. So that idea can be dismissed.

Then I added partial experimental support for the asynchronous interface of the debugger functions. As a result each function call executed in the target process is added to a work list provided by the debugger host. The work list gets executed in another thread, but it is still sequential and takes about the same time for each operation. My implementation isn't yet complete but I assume that the advantage of this will be that it doesn't block the UI and operations can be cancelled before completion and you can continue stepping through the program without waiting for all debug windows being updated. Fully supporting that seems realistic, but I expect it's still quite some work to do.

I guess that might help for one of the pain points, but execution will still be slow. To avoid that we could reimplement the worklist and combine requests ourselves before sending them to the target process. Not sure how feasible this actually is, though. Anyway, it still needs the same asynchronous evaluation as above, so the changes above would not be lost.

A completely different approach could be the implementation of something similar to C++'s natvis definitions, that allows expressions and some conditions on the debugger side and just has to read target process memory. That might get tricky as well especially with respect to templated structs/classes, though.

TurkeyMan commented 3 weeks ago

Where you say 'full support', you mean batching them and sending them to the debuggee all at once?

It's definitely a huge improvement if it stops locking up the UI, even if it still takes a while to populate the window when you stop. The slow stepping speed is definitely a show-stopper.

natvis does seem to be very fast, but it's certainly a weird definition language and it's really hard to implement some data structures. Your idea with __debug functions is definitely a far superior idea, if it can be made to work efficiently. Implementing something like natvis likely requires some sort of DSL to describe the custom evaluations, and probably a huge implementation effort?

rainers commented 3 weeks ago

Where you say 'full support', you mean batching them and sending them to the debuggee all at once?

I meant full support for the asynchronous interface. My partial implementation is currently limited to displaying arrays of structs with __debugOverview definitions, other enumerations (struct members, locals, etc.) are still using synchronous calls. It seems feasible and worthwhile to adapt these similarly, but still a bit of work. I'll try that.

Implementing something like natvis requires a larger effort, indeed. The expression evaluation part is there, though. I can imagine evaluating a static __debugOverviewExpr struct member and evaluating that as if it was a watch expression, but it get's more elaborate if control flow with statements is needed.

TurkeyMan commented 3 weeks ago

If you're spending some time with VisualD at the moment, do you also have any theories why go-to definition only works around 50% of the time? I guess the frontend is bailing out and much of the code has no semantic markup? Is there a way we can get a clear error output in cases where the semantic bails out? auto-complete and goto definition are really flakey now... I guess it's from the change to DMD-FE?

TurkeyMan commented 3 weeks ago

Actually, just to add to that something I hadn't noticed earlier; I was just playing around trying to identify patterns why completion suggestions weren't working... I half-typed a symbol (an enum key in this case), and then ctrl-space which should auto-complete or pop-up the completion suggestions, but it was doing nothing like usual... and then I got distracted for a moment, and then a good few seconds later the completion suggestions popped up. So, the semantic did know the proper suggestions, and it did eventually appear, but it took quite a long time for it to appear, and I just thought it wasn't working at all. When typing quickly, you expect ctrl-space to perform a completion INSTANTLY, same with the suggestion pop-up; it needs to be instantaneous to match the flow of natural typing, so even though it's working, it's just very slow and I always did some operation like move the cursor or click something that disrupted the waiting period.

So, that's encouraging, but what could possibly disrupt the completion from working quickly? It doesn't feel like something that should be involving IPC or any of those latent operations. Any thoughts?

rainers commented 2 weeks ago

While you edit the code you change the semantics at the current caret location, so the new source has to be evaluated first. At least the current file has to be analyzed semantically again. The dmdserver tries to minimize analyzing imported modules, but not sure how well this works.

In my experience, browsing the source with goto definition works quite ok as you don't edit it in that situation, but completion often is rather slow (I also found a few regressions regarding class members and fields). IIRC it is slower than when actually compiling, not sure what's causing that, I'll have to profile that.

I have prepared a version of the dmdserver that can log Visual D's requests and responses, maybe that can shed some insight. I want to complete the async debugger interface before the next release, though.

rainers commented 2 weeks ago

You can try the new build of MagoNatCC.dll from https://ci.appveyor.com/project/rainers/mago/builds/50930366/artifacts. This uses the async evaluation for __debugOverview calls, but doesn't work correctly in combination with switching the GC. It seems pretty responsive in my tests, but if I failed to handle completion or cancellation requests, it usually freezes the debugging session at some point, though.

TurkeyMan commented 2 weeks ago

Where do I place this DLL? My current instal doesn't seem to have those files to replace:

TurkeyMan commented 2 weeks ago

They're all under Program Files (x86), does that mean I should take the 32bit dll's?

rainers commented 2 weeks ago

Just like the last time, only this one needs to be replaced: "c:\Program Files\Microsoft Visual Studio\2022\Preview\Common7\Packages\Debugger\MagoNatCC.dll"

TurkeyMan commented 2 weeks ago

I'm seeing a lot of this sort of thing now:

rainers commented 2 weeks ago

Too bad. Is this one of the places where you would expect one of your custom strings to be displayed? If something else, could you specify more details about its type?

rainers commented 2 weeks ago

Ah, classes instead of structs seem broken. I'll take a look...

rainers commented 2 weeks ago

There is a new build here: https://ci.appveyor.com/project/rainers/mago/builds/50957984/artifacts that fixes class display and a few other issues.

rainers commented 1 week ago

I fixed several more issues caused by the conversion to the async interface. It is now released as part of https://github.com/dlang/visuald/releases/tag/v1.4.0-rc3

dlang / visuald

__debug function performance #282