Avoid reparsing on solution open

panopticoncentral commented 4 years ago

From @jasonmalinowski

This is following up to the idea we had to have the project system pass in an approximation of the options that impact parsing once we’ve completed evaluation; this will mean we don’t end up reparsing everything again when things change.

There are only a handful of properties that impact parsing. You can see them on the ParseOptions type at https://docs.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.parseoptions, and the derived types which also have the language version. So the ones that would probably matter for getting this right (or close enough) would be to pass in:

• /define (here’s how it’s processed: https://github.com/dotnet/roslyn/blob/7714ab4a89129ccc1885f961c0d3a12adce2add5/src/Compilers/Core/MSBuildTask/Csc.cs#L398) • /doc (just passed in directly from DocumentationFile) • /features (here’s how it’s processed: https://github.com/dotnet/roslyn/blob/7714ab4a89129ccc1885f961c0d3a12adce2add5/src/Compilers/Core/MSBuildTask/ManagedCompiler.cs#L834-L848) • /langver (it’s just passed in directly from LangVersion)

The handling for /define is mostly handling some strange or error cases – simply just treating it as a string.Split/Join is probably a good enough of an approximation for 99% of projects. The handling for /doc is a bit interesting: we translate the path existing or not into the parsing mode on the ParseOptions type, which is why you don’t see the string there but it’s something we’d need to deal with.

I think it should also be possible to avoid reparsing trees that don’t have any #defines if only the /define switch changes; that would help speed up Debug/Release transitions if the project system wasn’t destroying and re-creating contexts.

jasonmalinowski commented 3 years ago

So I dug into this a bit more and used some of the telemetry that @davidwengier added back in https://github.com/dotnet/roslyn/pull/48215 to see how much this would help.

Doing one simple test of opening Roslyn we threw away around ~9000 syntax trees due to a parse option change after we had previously parsed the tree. It's hard to give a precise estimate on the amount of throwaway work without actually implementing this, but there was 3 seconds of CPU time being spent in parsing before the first throwaway of any set of trees. Some other back-of-napkin math: there are 15 seconds of parsing total in the one trace I did, ignoring parsing happening in the servicehub process; if I figure Roslyn would have around 40,000 trees to parse (20,000 files generally with two targets), if we're throwing away 9000 trees that means 1/4 of the work is throwaway work which also is around 3-4 seconds.

@drewnoakes: how hard would it be to have IWorkspaceProjectContext.SetOptions() called with an approximation of the command line after evaluation? https://github.com/dotnet/project-system/pull/7216 was a pretty small change, and unlike that one we wouldn't even have to add a new API. Would calling the method here be OK if we got the data over to it?

@jaredpar: this would potentially need a small bit of tweaking in our Roslyn targets so we'd need your buy in (even if we do the work). To summarize the goal here from the Roslyn perspective, the project system runs a design time build, including the Csc task in the "give me the command line string only" mode, as a the way to get the command line options. Before it does that though, it does give us a list of source file paths from the MSBuild evaluation so that way the IDE at least knows some information while the full design time build happens asynchronously. The side effect of this is we parse a bunch of files in the IDE, and once we get a real set of options we toss it all out and parse again.

My proposal to the targets would be to stick in some extra property/item group in the compiler targets which produces an "approximately correct" command line string that gets just what we need for parsing to be correct. The project system gives us that and we start parsing trees with that. When we get the "real" command line string, we'll use that for the rest of the session, so the "approximately correct" is only a heuristic, and if it's wrong it's no different than what we are doing today. I say "approximately correct" because any logic has to be in MSBuild, and I wouldn't want to replicate some legacy logic around #define parsing; the goal is get the common cases right, and if it's wrong it's no worse than today. Thoughts?

davkean commented 3 years ago

I like that idea @jasonmalinowski means that project system has no idea what makes up the command line and just passes it through.

drewnoakes commented 3 years ago

I like the idea too. It sounds good in terms of cost / benefit.

Hopefully MSBuild is expressive enough to produce the approximate command line during evaluation. If not, we could do something via a new interface, defined on our side and exported by IDE. Getting it directly from evaluation is simpler though.

I'm happy to pick this up. What timeframe are you thinking?

jasonmalinowski commented 3 years ago

@drewnoakes I think it should be more than expressive enough as most of the transformations are just string splitting. As far as the project system side, is following that other PR good enough if I wanted to try getting the end-to-end working, or would it need something more complicated?

drewnoakes commented 3 years ago

@jasonmalinowski yes I believe that should suffice. You would need to plumb the property through the rule, dataflow, and into the data object used in that method.

jaredpar commented 3 years ago

@chsienki, @RikkiGibson

chsienki commented 3 years ago

Seems like a good idea. How does the project system actually get the properties to pass to the IDE? Does it evaluate the values of the CSC task and pass those in, or does the actual CSC build task get executed?

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

jasonmalinowski commented 3 years ago

@chsienki @RikkiGibson The actual Csc build task is executed, but with the switch that prevents it from actually invoking csc.exe or the compiler server. Essentially the loading is broken down into two steps:

The MSBuild evaluation pass, where we're only evaluating properties and item groups.
The MSBuild execution pass where we're actually running tasks/targets.

The project system starts telling Roslyn about information after the first pass, but since we don't have any parse options we're parsing with default options. Once the second pass happens we now get options, and potentially start parsing stuff over again.

chsienki commented 3 years ago

@jasonmalinowski Ok, my suggestion was going to be that we could do the command line 'construction' in the MSBuild task itself and return an out property, rather than having to do it via MSBuild evaluation, but it seems like this needs to happen before that point, so wouldn't work.

This is only really needed for the C# stuff in the IDE right? Xaml builds and friends won't need to use it? We already have various copies of the targets in different places that are slightly out of sync and I want to make sure we aren't going to make that worse.

jasonmalinowski commented 3 years ago

@chsienki Indeed, that's how it already works, but not until the execution pass. Because the execution pass is slow(er) than just the evaluation pass, that's ran async in the IDE.

This is only really needed for the C# stuff in the IDE right? Xaml builds and friends won't need to use it? We already have various copies of the targets in different places that are slightly out of sync and I want to make sure we aren't going to make that worse.

Hmm, where are these other copies? That's news to me.

chsienki commented 3 years ago

Theres one in MSBuild that's part of XamlPreCompile https://github.com/dotnet/msbuild/blob/eac68aa8b922bb1d4c661c14138019f064565f7d/src/Tasks/Microsoft.CSharp.CurrentVersion.targets#L168

There are also razor components / compilations https://github.com/dotnet/sdk/blob/cc2a9c45df8bbfae6cc55a146b14dec93139b224/src/RazorSdk/Targets/Microsoft.NET.Sdk.Razor.Component.targets https://github.com/dotnet/sdk/blob/a30e465a2e2ea4e2550f319a2dc088daaafe5649/src/RazorSdk/Targets/Microsoft.NET.Sdk.Razor.Compilation.targets

And I think at least WinUI is doing the same thing.

jasonmalinowski commented 3 years ago

@chsienki So if those are copies of the task/target that's fine. This would only be consumed by the project system, and produced in the main targets file. Those other ones don't need any adjustment.

jasonmalinowski commented 2 years ago

@drewnoakes OK if I move this bug to Roslyn since at this point I think we're planning on doing most the work at this point?

drewnoakes commented 2 years ago

@jasonmalinowski no concerns with that, thanks. Feel free to file a new issue here with anything needed from our side and assign it to me.

jasonmalinowski commented 2 years ago

So I gave this a shot at implementing and the project system changes are trivial, but the MSBuild target changes are a bit trickier than I expected. The problem being that for any modern SDK project, the preprocessor symbols are being generated as a part of targets, not during the evaluation pass anymore. It looks like /langver is being set still during evaluation so we can get that, and combined with https://github.com/dotnet/roslyn/pull/59406 that might mean we still get a bit of a win here.

@sfoslund @rainersigwald I see when https://github.com/dotnet/sdk/pull/11635 was implemented, the computation of the implicit directives was put in the execution pass -- I'm guessing that was needed because some construct there isn't good for evaluation only? The context here: we're playing with an experiment to get the language service more accurate options up front, as otherwise we spend a lot of time parsing C# files only to potentially throw out the parses once we get the full design time build results back.

@drewnoakes How does CPS then run design time builds during the execution pass? If we wanted to try running just the generation of those directives earlier, I'm guessing that would be expensive?

drewnoakes commented 2 years ago

@jasonmalinowski today we try and run a single design-time build to gather data from all targets. I don't think that running multiple passes would be a net win for the product.

In the case that the project system does have cached data on project load, we should be able to provide the correct arguments from caches during workspace construction, so long as the caches are up to date.

CyrusNajmabadi commented 2 years ago

@drewnoakes in this case though, this causes Roslyn to have go through and reparse everything :-/. It's definitely not ideal.

That said, @jasonmalinowski I have some ideas here.

Specifically around storing if there is #if in the file.

davkean commented 2 years ago

We should attempt to move everything we need for parsing back to evaluation so we don't need a design-time build to gather it. Targets also break the property editing experience.

rainersigwald commented 2 years ago

@rainersigwald I see when dotnet/sdk#11635 was implemented, the computation of the implicit directives was put in the execution pass -- I'm guessing that was needed because some construct there isn't good for evaluation only? The context here: we're playing with an experiment to get the language service more accurate options up front, as otherwise we spend a lot of time parsing C# files only to potentially throw out the parses once we get the full design time build results back.

I don't remember details any more, but I think the main thing is the looping-over-items implementation. I mentioned (https://github.com/dotnet/sdk/pull/11635/files#r428619651) a possibility to shift the computation to SDK-construction time, and that still seems feasible. @marcpopMSFT may have further thoughts.

rainersigwald commented 2 years ago

(though note that we'll have to be resilient to execution-time changes even if we get them off the mainline path)

marcpopMSFT commented 2 years ago

I don't remember details any more, but I think the main thing is the looping-over-items implementation. I mentioned (https://github.com/dotnet/sdk/pull/11635/files#r428619651) a possibility to shift the computation to SDK-construction time, and that still seems feasible. @marcpopMSFT may have further thoughts.

I don't recall either but I assume it was to ensure it automatically updated each new release so we wouldn't have to update it every year (and likely forget). Seems worth it though for the perf improvement goals.

jasonmalinowski commented 2 years ago

@jasonmalinowski today we try and run a single design-time build to gather data from all targets. I don't think that running multiple passes would be a net win for the product.

@drewnoakes: I guess the targets being ran I would already be running during the full design time build pass. I would have assumed the cost for all the MSBuild execution for a project would just be the cost of the underlying targets, but is that assumption wrong? Does multiple builds of different targets take longer than running those as one batch?

In the case that the project system does have cached data on project load, we should be able to provide the correct arguments from caches during workspace construction, so long as the caches are up to date.

@drewnoakes: I know you had mentioned batching in an internal email conversation at one point; if the caches are present today how soon do we get a command line?

jasonmalinowski commented 2 years ago

I don't remember details any more, but I think the main thing is the looping-over-items implementation.

@rainersigwald/@marcpopMSFT: is the general problem an ItemGroup can't use other item groups during evaluation? And is there a general mechanism already in place for trying to generate that in some other way during SDK construction or would that have to be invented? Happy to give it a shot if it's not too much pain.

davkean commented 2 years ago

It was going to be added for NuGet central package version support for legacy project system but I don't know what happened to that.

@drewnoakes: I guess the targets being ran I would already be running during the full design time build pass. I would have assumed the cost for all the MSBuild execution for a project would just be the cost of the underlying targets, but is that assumption wrong?

Yes that assumption is wrong, this is why we don't use MSBuild for fast up-to-date.

drewnoakes commented 2 years ago

I know you had mentioned batching in an internal email conversation at one point; if the caches are present today how soon do we get a command line?

I haven't gathered numbers. Perhaps @arkalyanms knows. I would expect the delay involved in waiting for the DTB cache to load to be low enough that a user shouldn't notice the difference. Whether that's a good trade-off comes down to whether we avoid enough redundant work to speed up other areas. But we can't make a call here until we have some real-world numbers.

jasonmalinowski commented 2 years ago

We've completed this work.

dotnet / roslyn

Avoid reparsing on solution open #60014