Reactive approach to FSAC design - discussion

I’ve written a document describing some of the more interesting ideas for improving general FSAC/language server/IDE backend.

https://github.com/Krzysztof-Cieslak/notes/blob/master/FSAC%20design/Reactive_approach_to_FSAC_design.md

CC: @dsyme @cartermp @7sharp9 @forki @tihan @ninofloris @auduchinok

It's funny, I was thinking actor model about halfway through the entry, and then you mention it at the end. Actors would give a nice layer over supervision/restarts (as we see with managing FSAC restarts in ionide today), and stateful actors (with persistence) would be an interesting way to model something like the symbol cache, where the serialized state is (a subset of) the actor state

To start, I think it's worth contextualizing where we are now. F# (at least within VS) is about where Roslyn was when this issue was filed in 2015: https://github.com/dotnet/roslyn/issues/7082

To summarize that long thread:

There were some memory leaks that were tricky to uncover, and these affected people
Everything was running in the same 32-bit VS process, and even with memory leaks fixed, there was simply too much stuff going on
"Roslyn 2.0" (quotes since that was never an actual name), set for VS 2017, would have a tweaked architecture compared with VS 2015 to address the second point

Ionide and Rider use an out of process model for hosting FCS, but some of the fundamentals aren't much different. The working set can get huge in both IDEs, especially for larger solutions, and it's not uncommon to hear about 10+ GB of RAM being used. Many operations can be slow, and the fact that they're out of process doesn't make a difference.

As you're likely aware, we made a lot of changes to the compiler that aligned with VS 2019 GA after careful measurement of traces from users and our own dogfooding. Although there's still two issues left - boxing of ranges any time hash is called on it and LOH allocations with the LexBuffer - the big problems were resolved.

After the past 1.5 weeks of analyzing memory dumps and traces from a big internal solution under various loads, we've found two memory leaks - one of which is VS-specific and fixed, and the other we'll queue up a fix for (it also requires a change in the TPSDK, since that thing reflects into the internals of the compiler...). We also found a data structure that could be turned into a struct, saving up to 30MB of the working set for a given process using FCS. But barring any further findings, we're pretty much at the point where we need to consider a re-architecture for the F# tools in VS, and this will certainly affect Ionide and other tools as it will likely involve FCS to some degree.

Roslyn adopted an approach similar to what you write in your notes:

Various features areas are now out of process components (example: https://github.com/dotnet/roslyn/issues/6846). Analyzers were a big one to be out of process, because there are some that do full solution analysis, which is quite heavy.
Not every individual feature has its own process. It's based on architecture. As per the linked issue, IDE diagnostics are in the same process; there isn't a single process for each analyzer.
"64-bit all the things!" ... not really. Since 64-bit means every pointer is double the size, that matters for people using machines like laptops with less RAM. Just for reference, we measured a ~1.5x increase in memory usage with a prototype we have internally, so 64-bit <> better; such a thing should be based on measurements and not "because 64 bit is better".

Not all is rosy in this world, though. There was a tremendously long tail of bugs that the Roslyn team had dealt with; nearly a year's worth of them. As it turns out, communicating information between N processes is challenging, and there are additional perf considerations regarding IPC. If any one person thinks they could do this themselves, just know that a team where people each at 5+ years of experience building these sorts of things dealt with issues for nearly a year. It's a huge undertaking. Frankly, it's a hell of a lot easier to have one monolithic .dll be responsible for everything if you can get away with it.

Improving FCS

To avoid a similar tail of bugs, I think it's best to focus on improving FCS itself before multiple processes are considered.

One of the reasons why it worked out fairly well for Roslyn, despite the tail of bugs, is that Roslyn itself operates with an immutable snapshot model. When you invoke a feature, the known universe is quickly calculated with a compilation snapshot and everything is based entirely on that. If you take another snapshot and it's different than the previous one, the previous is discarded and the new snapshot is the known universe. If they're the same, then barely any work is done. FCS attempts to cache things that have been already figured out (see IncrementalBuilder.fs for more detail, though good luck in trying to understand its contents). However, in practice it is nowhere near as good as Roslyn's model due to the fact that files are re-checked all of the time.

Additionally, this model allows Roslyn to speculate on if certain work even needs to be done. For data flowing between cross-project references, Roslyn will speculate if things have changed based on the semantic meaning behind operations you perform. FCS has no ability to do this, so if you have a relatively deep chain of project-to-project references, changes will incur a typecheck of all dependent projects resulting in a lot of redundant work. This makes IntelliSense seem unreliably time-consuming when you make changes across multiple files.

Beyond what you mentioned about having a priority queue, FCS itself may need to adopt a snapshot model similar to Roslyn. Perhaps there are other alternatives, but the model has kept things sane in their enormous codebase, and because it allows certain speculation that's rather cheap to perform, IDE features can be pretty quick and consistent regardless of what you did prior to a particular action.

Additionally, a snapshot model like what Roslyn uses makes managing N processes significantly simpler. If all you do is work on the known universe that was computed (and compare against the last known universe efficiently), there isn't much complication in keeping things synchronized. The universe is what it is, and that's all there is to it. Failing to have a simplifying mechanism like this would spaghettify code and make N processes horribly difficult to manage properly.

Type Providers

Type Providers are a big problem. For those who aren't ware, ProvidedTypes.fs is an ad-hoc copy-paste of pieces of the F# compiler that has been further modified to accommodate the needs of type providers. You can think of every type provider library as being packaged up with its own fork of the compiler, and that fork does things subtly differently than whatever compiler a user is using.

This has serious performance ramifications:

Expensive operations like reading IL are done twice. The F# compiler does it, but so does the type provider. We've observed many dumps and traces where IL reading was a significant contributor to memory and CPU usage, and doubling that is simply bad for performance
Information that the F# compiler has already calculated is re-calculated redundantly. For example, determining where mscorlib is located, or what version of FSharp.Core is being compiled against. The F# compiler already knows this information, but it's re-calculated all of the time.
The architecture of FCS, which does enormous amounts of re-checking files when you modify them, means that the previous two points come into play again and again and again. There have been some efforts to attempt to cache things, but one of the trial solutions that was merged is actually a memory leak. So as of this moment, any type provider that uses an up to date TPSDK has a memory leak in it.

Furthermore, changing things in the F# compiler can cause downstream breaking changes for any consumer of type providers. This is because ProvidedTypes.fs also reflects into the internals of the F# compiler that is hosting the type provider. We realized this when we were working on a fix for a known memory leak. Fixing it would break any type provider loaded with a newer compiler.

What this means is to that fixes to the F# compiler have the potential to inadvertently break anyone using a type provider. Although this isn't going to happen very often, it can happen, and it will require all type provider authors to update their dependencies any time it will happen. This will be the case when we fix the known memory leak we found.

This is not a sustainable model at all, so there are two options:

Encourage people to not use type providers except for the purposes of data exploration and scripting
Change the architecture of the TPSDK and F# compiler such that the TPSDK is merely an API into the F# compiler, rather than an API into the fork of the F# compiler that resides inside the TPSDK

The first is untenable, but the latter also means that any time we ship an updated compiler, all type provider authors must update their dependencies. It also means that any consumer of type providers who does not update their dependencies could be broken just by updating their tool set.

An additional wrinkle here is that the TPSDK is neither owned, maintained, nor contributed to by Microsoft. We have no authority over it, yet it affects people using F# (and VS). We'll contribute back to it if we can, but if we must make a change before a release and doing so would break Type Providers, we will make that change and break Type Providers. Although they are part of the F# spec, none of their implementation is actually in our codebase. And we have no support obligation for any component that reflects into the internals of our systems. So, an additional approach we could take from the Microsoft perspective is to make the changes we need to make F# and FCS great, and say, "that's just too bad" if type providers break. I'd prefer not to have that happen, but it's being considered.

The symbol parts of the FCS API were supposed to be snapshots and not require a lock on the reactor etc, I dont think this was ever realised though.

@cartermp Your statement on the TPSDK is strange, it open source, MS is free to commit fixes just like other companies have when issues have been discovered.

From my point of view when things go bad in ionide the reactor count starts to go through the roof (1000+ in a minute or so then completions unavailable for at least 3 or 4 mins) which indicates that cancellations and queued messages are not behaving correctly. I think I would also make parse requests cancellable early too as well as just the type requests.

@cartermp The comments about the TPSDK aren't really accurate.

The technical issues are real - though not insurmountable. However the procedural issues are not accurate:

Microsoft took a decision in 2013 to make some of the components needed to write type providers part of a "starter pack" that would be internalized in each type provider and released under Apache 2.0. This was a team decision made by Visual F#.
Microsoft have always had commit and maintenance rights on these components as they moved from codeplex to "fsprojects" (a reasonable place for it to land in ~2016). I renamed this component to the Type Provider SDK about 18 months ago, made sure cross-compilation worked (the cause of increase in resource usage and complexity), and made sure the templates worked.
I am co-maintainer on that repo. You are welcome to be as well. It's perfectly possible us to co-develop it with the F# community.
ProvidedTypes.fs does not contain a copy of the F# compiler. It contains an encapsulated internalize IL binary reader and writer, which is a much smaller thing. These could conceivably be replaced by any other IL binary reader and writer (eg Mono.Cecil), though we have always been wary of have type provider components have binary dependencies, because they are loaded in-process. Fetching the information from the F# compiler is also a possibility, though comes with massive potential downsides.

There's absolutely nothing stopping us solving the issues identified together. We shouldn't just say "hey, we need to re-do everything or break compat" just because it's the first time Microsoft have worked with a technical problem in the area - that's not a sustainable path forward either. This has been a committed part of the surface area of the F# compiler and toolchain for many years and it is perfectly possible to make incremental improvements here.

We've long known the TP API used into the F# compiler has long had problems with cross-compilation, notably

The incorrect or incomplete reporting of target assemblies by the F# compiler. This is why internal reflection is used. We should take the opportunity to update the F# TP API to report this information correctly.
The costs associated with having TPs re-read assemblies and map from the "authoring" model to the target assemblies.It would be great to solve these problems in the API, without making it unsustainable from a versioning perspective.

@cartermp Have submitted a partial fix for the memory leak you mentioned here: https://github.com/fsprojects/FSharp.TypeProviders.SDK/pull/295. There's no "ideal fix" in this area (except re-using the readers from the F# compiler, though the problem there is the size of the compiler API exposed).

Unfortunately, my thoughts remain unchanged despite your points @dsyme. Failing hard evidence that the TPSDK is not as I characterized it - ad-hoc copy-paste of IL reading/writing that duplicates work and is thus horrible for performance, built in such a way that unrelated fixes to the compiler can break all users - I'm just not convinced.

Keep in mind that my statements aren't how we feel about it, this is based on two weeks of analyzing traces, dumps, and the TPSDK codebase, understanding why the diagnostic data is the way it is, and trying fixes for issues we identified. If you or others have hard data to suggest otherwise, I'm absolutely happy to consider things differently.

I also suggest we table a discussion about the TPSDK here and move it elsewhere. There's plenty to discuss about FCS itself, which certainly doesn't have a shortage of addressable issues.

I'm just not convinced.

Both Kevin and I have long known about the issues you identify (Kevin in passing, me in detail). However, they must be seen in context. Essentially, we had to make compromises in 2015-18 and I'll stand by the them - indeed I'll stand by every line in ProvidedTypes.fs, though am by no means proud of it all. I understand you've spent two weeks discovering that compromises were made and there are lingering issues. What you see is essentially the lingering cost of the transition to .NET Core/Standard. We should now make progress on the specific remaining issues.

Put simply, in 2015-18 we were focusing on the the critical need to for type-providers to both execute in different runtime contexts (because of .NET Core) and cross-compile to different runtime targets (because of .NET Standard). Addressing this was by far our highest priority. At the time, type providers were assuming both .NET Framework design-time tooling and .NET Framework targets, including Reflection.Emit at design-time for generative type providers. We had to fix that. So we did.

Further, at the time our TP-to-compiler API was more or less "frozen" and couldn't even communicate the set of assemblies correctly. This was largely because we had no functioning cycle of iterative improvement to either FSharp.Core nor the compiler nor FCS. These delivery issues have now been fixed, which is why we can now make progress.

Further, we had to avoid the situation where type provider components had forced, unnecessary binary dependencies. Each TP had to be potentially stand-alone, since they are loaded into design-time tooling where conflicts can happen.

For this reason, I chose to do the following (this was after long discussions with Kevin where he didn't necessarily agree but eventually gave in since, well, we really had no other choice):

We removed any TP design-time dependency on reflection emit, by internalizing an optional binary writer in each generative TP. Remember, at the time Reflection Emit didn't even exist for .NET Core. Further the use of Reflection Emit was a source of chronic bugs where wrong code was being generated for wrong target binaries.

Today, there are no known bugs nor resource problems in that binary writer - the issue here is code duplication - but that is an issue internalized to the TPSDK, we can make progress on that, I've recorded an issue for that.

My judgment is we made the right call here, and that there is no urgency in changing this, since there are no known bugs with the approach.
We removed any TP design-time dependency on reflection-only-load to read the context DLLs for the target compilation, by internalizing a binary reader.

The original approach of using reflection-only-load was as buggy as all hell for cross-targeting type providers. In contrast, the current use of the binary reader has zero known correctness bugs. Yes, the duplication of metadata is the cause of our resource usage problems and I've recorded a tracking issue for this here.. However the duplication is not catastrophic, though is problematic.

My judgment is that we made the right call here, but now need to make progress on the resource usage. The code duplication is not something I care about as long as it is internalized. I'd love to get rid of both the code duplication and the resource duplication (though exactly how we do that with out massively complexifying the API of the F# compiler I'm not sure)
The TPSDK uses a terrible reflection hack to determine the list of assemblies in the target context.

This can be remove because https://github.com/Microsoft/visualfsharp/pull/1037 is now sufficiently widespread (included in VS2017)

Finally, there is one other major sin in the TPSDK to achieve cross-targeting, notably

The TPSDK uses a set of reflection hacks into quotations.fs to create quotations for target contexts, because the runtime checks performed by quotations.fs are overly stringent, assuming that all quotations are being created for the executing design-time context.

This was a compromise but it was done knowingly as enabling something that FSharp.Core should always have supported. We can make progress on this, see the tracking bug

The last thing we did was this:

We adjusted the type provider loading mechanism to load TPDTC components from typeproviders directory in the package. This allows type providers to have localized dependencies (though each type provider is not in an isolated load context - the dependencies can still interfere)

My assessment is that this mechanism is stable and functioning.

Together these compromises allowed us to deliver cross-executing, cross-targeting type providers in 2017-18. Mission accomplished.

So yes I would say the "adhoc" characterization is wrong. For the TPSDK, the use of internalized binary reader/writers was to reduce dependencies and was deliberate. As a result I can justify (though am not proud of) every line of code in ProvidedTypes.fs, given the priorities we had 2015-18. The priority was cross-targeting type providers without extensive dependencies on reflection emit and reflection-only-load. For that, we needed a binary reader/writer that we could internalize and we used the one we had available.

Yes, there is duplication of resources and code. But we can now reduce and solve that resource-duplication problem. I made compromises (knowingly), you've discovered these and some of their current ramifications, and now we should now make progress on the issues above.

The symbol parts of the FCS API were supposed to be snapshots and not require a lock on the reactor etc, I dont think this was ever realised though.

The use of members on the symbols do not require or take locks.

@dsyme Thanks, we'll engage on the TPSDK issues.

@7sharp9

I think I would also make parse requests cancellable early too as well as just the type requests.

Parsing has been free-threaded for quite some time (thanks @auduchinok and @dsyme); this used to muck up the reactor queue but not anymore. In general it's made the IDE experience much more responsive.

Based on the perf analysis work we've done, there are incremental areas to improve and architectural changes that we may need to make:

Incremental

Pickling F# metadata allocates 200MB on the LOH, which over time leads to long GC pauses: https://github.com/Microsoft/visualfsharp/issues/5933 (in the case of this trace, 3.5GB of LOH allocations)
Excessive tuple allocations in typechecking https://github.com/Microsoft/visualfsharp/issues/5938 (struct tuple didn't help)
Ranges are still boxed when they're hashed: https://github.com/Microsoft/visualfsharp/issues/6052

Architectural

We do excessive re-checking of files in normal usage when it's barely needed, exacerbating the above issues
We don't have a speculative design about if we need to recompute things for changes in dependent projects, making things like changes in one project take a long time to show in completion for another

Krzysztof-Cieslak / notes

Reactive approach to FSAC design - discussion #1