dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
19k stars 4.03k forks source link

[ML] [Performance] Caching compiler data #27108

Closed jamesqo closed 1 year ago

jamesqo commented 6 years ago

/cc @heejaechang This is from an email you sent me a couple of months ago.

One thing we would be interested in to improve will be caching. Right now, Roslyn uses various caches but very simple either MRU/LRU cache or code explicitly enabling or disabling caching for certain part of data.

It would be nice if we can use machine learning to figure out what portion of code/data this particular users are consuming most and cache more of that data in memory. Basically, roslyn deals with comparably big graph, and multiple versions of those big graphs. If we can better predicts what part of that graph users will most likely consume in near/midterm futures, it will be big win for users.

If the problem of choosing what data to cache is still a significant perf issue for the compiler, I'd be happy to help. I looked over the IProjectCacheService file you linked to and this is the gist of what I got:

@heejaechang, I think you were saying before that you think a machine learning algorithm could help out by caching only the parts of compilations / syntax trees (symbols / nodes respectively) that were likely to be used later, rather than the whole tree. This would improve compiler memory usage & potentially make it beneficial for a command-line scenario. Is that correct?

heejaechang commented 6 years ago

@jamesqo yes. improving that will help us to make better perf (CPU and memory wise). and this cache is only for IDE not command line since command line deals with only 1 project at a time and they hold everything in memory.

IDE deals with whole solution (not just 1 project), and doesn't want to hold everything in memory at the same time.

heejaechang commented 6 years ago

tagging @sharwell

jamesqo commented 6 years ago

@heejaechang Ok, thank you for clarifying. I want to make sure I'm understanding everything correctly-- so the goal is to cache common subtrees / individual symbols that are likely to be used instead of the entire compilation / syntax tree, correct? For example, if we see a particular expression a lot like if (arg == null) { throw new ArgumentNullException(nameof(arg)); }, then we would want to reuse the SyntaxNode rooted at the if statement, rather than re-instantiating a bunch of nodes each time?

I'm also confused at how the MRU cache works-- looking at here the values from _nodes are never actually exposed publicly. Is the intent to keep the instances alive if caching is enabled by storing them in _nodes, and then return weakrefs to them afterwards?

heejaechang commented 6 years ago

so the goal is to cache common subtrees / individual symbols that are likely to be used instead of the entire compilation / syntax tree, correct?

no that's not what I meant. tree can't be split to subtree or symbol can't live outside of compilation.

..

let me say this way, for small solution, the cahce doesn't matter. we can save everything in memory. it will not cause much problem. and we are not looking for micro optimization to save like 1K-10K bytes memory.

what we are looking for is saving 100+MB memory or GB+ of allocations.

for solution like Roslyn which has 150+ projects, 10000+ files, if we hold everything in memory, that means 150 compilations, and 10000 trees, 10000 texts and etc (that can easily go over 1GB+). since we use snapshot model, that can be multiplied based on how many snapshots are in use at any given time (we do have a lot of sharing of same data to reduce objects in the memory, but worst case there could be no data to share between snapshots).

currently we use very simple algorithm to decide what data to drop and what data to hold onto memory.
obviously, more data we drop, we save more memory, but we need to use IO/CPU/allocations to re-create those data if someone need them. more data we hold, we use more memory, but we save IO/CPU/allocations since we don't need to create them.

if possible, we would like to improve that algorithm to monitor resource usage pattern (what tree is being used, what symbol/compilation is being used, what text is being used and etc), and decide what to drop and what to hold in the memory which hopefully in return make Roslyn faster (since we get higher cache hit) while using less memory (since we hold less unnecessary data)

heejaechang commented 6 years ago

in case you wonder what is the unit we use for this cache. mainly 3 types. SyntaxTree(Nodes), Compilations(Symbols), Text

we usually always put Text thar are bigger than 4K in memory map file and read them when needed except open files. for open files, editor buffer act as cache. text that are smaller than 4K are always in memory.

SyntaxTree (or Root SyntaxNode), we have cache such as active projects that holds onto every root there regardless whether we need them or not. forgot whether we hold root node alive for dependent projects as well or not.

Compilation, I believe we cache 2 recent one + active project + its dependencies...

sharwell commented 6 years ago

we usually always put Text thar are bigger than 4K in memory map file and read them when needed except open files. for open files, editor buffer act as cache. text that are smaller than 4K are always in memory.

Put another way, we cache the syntax trees that result from parsing files. Source files less than 4K are held in memory, while the remaining ones are only held by a WeakReference that allows them to be released. We've observed various profiling data indicating that this is imperfect:

The baseline case simply holds weak references to the parsed syntax tree. An improved algorithm would replace the current "4KiB threshold" algorithm with one that more aggressively holds "relevant data" in memory and makes room for it by more aggressively avoiding references to the small files getting held today.

jamesqo commented 6 years ago

Thanks @sharwell / @heejaechang for clarifying. I've thought about this for a bit.

Let's take a step back and see if we can improve on the algorithm without introducing machine learning just yet. One thing I think we can take advantage of is that people tend to stay on the same directory level while opening files. We can assign each file a cache priority between 0 (low) and 1 (high) based on its directory level:

Files inside the folder with the lowest cache priority will be the first to go in order to make room for incoming files.

The priorities can be calculated like this:

// A file in folder A is initially opened
p_A = 1.0
p_B = 0.0

// Each time another file is opened
p_A *= 0.9
p_B *= 0.9

if (newFileOpenedInFolderA)
{
    p_A += 0.1
}
else
{
    p_B += 0.1
}

// If the user opens lots of files in folder A, p_A will be close to 1
// If the user stops opening files in folder A, p_A will decay by (0.9)^n

This can replace the current system of caching everything in the current project... thoughts?

jamesqo commented 6 years ago

@heejaechang

SyntaxTree (or Root SyntaxNode), we have cache such as active projects that holds onto every root there regardless whether we need them or not. forgot whether we hold root node alive for dependent projects as well or not.

Compilation, I believe we cache 2 recent one + active project + its dependencies...

Is there a reason syntax trees / compilation would have different heuristics for caching?

@sharwell

Some projects have a few very large files, where restoring one of the files from the cache back into memory results in many other weak references getting released. In the worst cases, an algorithm is attempting to operate on a cycle of related data which constantly releases references, and the performance rapidly degrades into a long serialization operation.

I think I get the gist of what you're saying. So the weakly referenced values aren't in the cache but still in memory because the GC hasn't kicked in yet, but large files being read in from disk will cause them to be collected. (I assume you meant restoring one of the large files from disk, since the cache is already in memory.) And the objective would be to keep those small, relevant files in the cache.

heejaechang commented 6 years ago

@jamesqo I believe we have tried many heuristic based logic and at the end, drop it and settled to this simple logic since all heuristic ones gives complexity but didn't give consistent result. so no point to maintain such complex logic when simple one is as good as them.

that's why I was interested in ML since I want to move away from such heuristic based one which doesn't actually use actual usage pattern of current users with current solution at this very moment.

...

for example, your example might work for some solution, but might not for others, for example, some people has all files in 1 flat folder containing over several thousand files. for language point of view, folder has no meaning.

heejaechang commented 6 years ago

@jamesqo I actually didn't want to get into too much detail on how our thing implemented. since i didn't want this cache logic to leak into all of our complex implementation.

what I was thinking was a black box (ML logic), whenever a feature wants to access syntax tree(root node), compilation (symbols), text, I let ML to know it. and cache basically ask ML (should I cache it?), and ML simply say yes/no based on last 5 mins usage pattern of the data.

each cached data has expiration timeout (say 5 min), whenever data get expires, it ask ML, should I keep in the cache? and ML simply tells it yes/no based on last 5 min usage pattern of the data.

this will let us cleanly decouple cache logic (and ML) from our other implementations.

the ML should not care how I cache it, when data is kicked out, how it get persisted or re-created. that's not the ML's job. ML job is simply from the last 5 min usage data, it predicts whether the data looks like soon to be re-used or not. (we can add weight later like how expansive to re-create the data, but for first version, I think we can ignore that for now)

jamesqo commented 6 years ago

for example, your example might work for some solution, but might not for others, for example, some people has all files in 1 flat folder containing over several thousand files.

Well... do people do that in big projects? o.o I feel sorry for their maintainers. Anyhow, in this case the heuristic I've proposed would be no worse than caching everything in the project, which is what we're doing now.

for language point of view, folder has no meaning.

Yes, but it would still help predict usage patterns.

@heejaechang In your second comment, what exactly do you mean when you say "last 5 min usage pattern of the data"? What context would the ML algorithm have access to in order to base its predictions off of?

heejaechang commented 6 years ago

@jamesqo no we don't cache everything in memory. we cache data around active project in memory, but not other ones. by active, type that has focus in VS editor.

anyway, I do not want to get into how our caching works in detail. if you want to I can point you to the code and explain a bit but I don't think that's relevant here since we are not looking to incrementally improve what we have now.

...

Yes, but it would still help predict usage patterns.

sure it could, but I would think something more semantic is better for such prediction like does file/type belong to same project. does these 2 projects has dependencies each other and etc.

anyway, I was expecting a kind of variation of the smart intellisense we released on build? like if a user is working on a type A in editor, it knows usually type B and type C used a long with type A. so it make sure text, syntax tree for type B and C is in the cache. ...

What context would the ML algorithm have access to in order to base its predictions off of?

we are Roslyn, so we can give every information ML training wants.

we can let it know what syntax tree is requested, what symbol is requested, we can tell what text is requested. it could be content, it could be what project it belongs to, it can give what architecture it is targeting, what parsing/compilation options and etc..

whatever info ML needs to make the prediction better.

...

at the end, I hope. rather than every trees in the active project, it would be nice it only cache part of trees in active project that are currently actively used. and instead of unused trees in memory just because it belong to active projects, it would be nice trees that belong to dependent projects of active project are instead in cache because ML knows those trees usually used along with the current trees that are actively used now.

something like that..

CyrusNajmabadi commented 6 years ago

I'm curious why we couldn't just use an LRU/MRU system for this sort of thing. Just keep around the stuff that people are touching, and when we need to evict, evict the very oldest...

jamesqo commented 6 years ago

@CyrusNajmabadi I think that's the system in place right now. @sharwell had a good explanation above on why it's not so great.

jamesqo commented 6 years ago

@heejaechang I thought about it some more. I still feel like machine learning could be overkill, but we should let the algorithm speak for itself. If it finds that there are other really good predictors besides whether 2 files are in the same folder, then maybe introducing ML won't be such a bad choice.

I think I have a plan for the appropriate ML algorithm to be used here. Every time a new document is opened we would grab info from the document and use the ML algorithm (softmax regression) to compare it with the info of other documents in the cache. That algorithm would output the probabilities for each document in the cache being the next one the user is going to open.

In order to implement it I would need to have some code run in ProjectCacheService every time a new document is opened in VS. Do you know can I do that?

Also (more importantly), I would need to collect telemetry from other VS users in order to feed training data to this algorithm. No matter how good an ML algorithm is, it needs lots of training data before it can perform well. Would this be OK? If so, how can I start recording telemetry events to run each time a user switches a document?

heejaechang commented 6 years ago

I was hoping to train ML within the 1 users data. (like what we do here - http://source.roslyn.io/#Microsoft.CodeAnalysis.Remote.Workspaces/Diagnostics/IPerformanceTrackerService.cs,e7216ad2a0832085,references) I don't think we can use other people data here. if it requires, other people's data, it would be no go.

and I guess predicting what document will be opened next is useful, but what I was looking for was something like this

users is working on type A. and type A has BaseA as base type, and implements InterfaceA,B,C.

type A has methods M1, M2, M3, M4 which return types MR1, MR2, MR3, MR4 and etc. and BaseA has methods BM1, BM2, BM3, BM4 and which returns type ...

so this becomes a graph. and each method has references to symbols the project the type belong to has reference to and itself.

among those symbols, through looking at how symbols are used together most likely, and put those info in the cache.

...

if it is something you think it won't provide much benefit, that's fine. it is not like we must do this. I can ask our data scientist people if they ever do the smart intelligence stuff per solution (basically analyze current solution to predict what is the most likely thing they will consume next), we can also use that for caching as well.

heejaechang commented 6 years ago

I'm curious why we couldn't just use an LRU/MRU system for this sort of thing. Just keep around the stuff that people are touching, and when we need to evict, evict the very oldest...

we had that, and we got rid of those since turns out blindly doing that uses almost same amount of memory as just caching everything for active projects and its references.

we need better algorithm than that to actually have any improvement.

jamesqo commented 6 years ago

I was hoping to train ML within the 1 users data. (like what we do here - http://source.roslyn.io/#Microsoft.CodeAnalysis.Remote.Workspaces/Diagnostics/IPerformanceTrackerService.cs,e7216ad2a0832085,references) I don't think we can use other people data here. if it requires, other people's data, it would be no go.

Well, any ML algorithm needs a huge training set if it is to perform well. If we don't have access to other people's data, it's also kind of a no go. For example, the IntelliSense system that was released last month is trained on thousands of open-source codebases-- one solution isn't going to cut it.

so this becomes a graph. and each method has references to symbols the project the type belong to has reference to and itself.

among those symbols, through looking at how symbols are used together most likely, and put those info in the cache.

I don't fully understand what you're asking here?

heejaechang commented 6 years ago

I was trying to say. for small solution, we can just cache everything in memory or use very simple logic for cache. not much perf issue.

we want it for big solution that contains hundreds of projects, over ten thousands of files, million lines of code and etc. since that is where smarter cache will show its benefits. and for such solution, what I was trying to say is solution might itself contains enough code/graph (data) to train ML.

it also doesn't need to ML only. it can be simple MRU/LRU + ML which can handle obvious items (currently repeatedly get hit) + usage pattern (not used repeated right now, but we know these item has close relationship each other)

svick commented 6 years ago

How would the ML algorithm be tuned and evaluated? Is there some data set that can be used for that? Or a way to build such a data set?

jamesqo commented 6 years ago

@heejaechang I think I get more of what you're saying now, correct me if I got anything wrong: based on whatever users are currently working on, we would predict which symbols/trees (corresponding to types, methods, etc) currently in the cache are likely to be needed next, and assign a higher priority to those in the cache. This would only be done for very large solutions to give the ML algorithm enough training data.

heejaechang commented 6 years ago

@jamesqo exactly.

jamesqo commented 6 years ago

@heejaechang If that is the case, I'm still not sure if it would be feasible without access to other people's data. If the goal is to predict which symbols/trees are most likely to be used next, we will need lots of data on which symbols/trees the user requests; the data without the context of which files the user loads often is meaningless, because there's no metric the model can be evaluated with respect to.

heejaechang commented 6 years ago

@jamesqo @svick the solution has million items (project, document, symbol, tree, text) in it with full of strong relationship (pointers and dependencies) between them.

and every executable code (million lines of code) has weak relationships (what symbols/tree/text/api are used together) between them. are you saying that is not enough to train the model?

and this data get updated as user code in IDE.

..

I am 100% sure that data (graph) will be bigger than every nuget data you gathered from every nuget package ever created.

svick commented 6 years ago

@heejaechang

are you saying that is not enough to train the model?

I don't know, I don't have enough experience with ML to answer that.

What I do know is that there are many ML algorithms, several of which might be suitable for this task, and each will likely have some hyperparameters.

My question is: How do we decide which algorithm to use, and what values of its hyperparameters to use? Or how do we even decide that a particular ML algorithm with particular values of its hyperparameters is better than the current approach?

heejaechang commented 6 years ago

@svick if we decide this is possible, then we can create a way to measure effectiveness. previously, when we worked on cache, we did this.

pick some scenarios such as typing, find all reference, navigate to and etc

run that scenario with current VS and result perf data such as time it took, memory it uses, UI responsiveness, allocations, CPU time, cache hit rate, cache size and etc.

change cache to new one.

run the same scenarios. and see whether any perf number gets improved (feature ran faster?, used less memory? better UI responsivess?, less allocations?, less CPU time?, better cache hit rate? less cache size? and etc.

and choose one that is generally better than others.

jamesqo commented 6 years ago

@heejaechang

the solution has million items (project, document, symbol, tree, text) in it with full of strong relationship (pointers and dependencies) between them.

and every executable code (million lines of code) has weak relationships (what symbols/tree/text/api are used together) between them. are you saying that is not enough to train the model?

Nope, it's not; let me explain. Machine learning isn't magic-- it's basically automated trial and error (or at least in this scenario). In order to find the best possible solution, the model needs a measure of effectiveness for the predictions it makes. In this case, we are trying to predict which symbols we've cached are likely to be used next, so what we need is not millions of files, we need millions of datasets that record what symbols had to be loaded during a session, because that's how the model can tell if it made good predictions or not (compare its predictions against the actual symbols that had to be loaded).

This is why we probably need data from other users-- a single user won't have millions of IDE sessions, so we'll have very little to base our predictions on.

heejaechang commented 6 years ago

@jamesqo but you can see live whether cache got hit or not. so it can know right away whether it prediction was right or not live?

jamesqo commented 6 years ago

@heejaechang Yes, but it needs to do that a couple of million times before it starts making good predictions. Without any prior training, the model will make terrible predictions and fail all the time. So it's necessary to gather data from other users and train the model against that before it can make good predictions for the current user.

heejaechang commented 6 years ago

@jamesqo if that is problem, we can mix ML with other one like what we have now and only switch to ML once it is confident that it has enough training data. and that data can be saved to disk so next time for the same solution, it has that info right away.

jamesqo commented 6 years ago

@heejaechang I still dunno-- unless the user is doing a superhuman amount of coding, there won't be a whole lot of data to train on I think. Collecting other people's data would be the only way around this.

jamesqo commented 6 years ago

@heejaechang Why is collecting telemetry a no go? Doesn't VS already do this to some extent?

heejaechang commented 6 years ago

@jamesqo getting user data is privacy issue. not sure whether you heard about GDPR. we might be able to, but that will require a lot of hoop to go through if possible. and what I was looking for was not generic cache but one that specific to this solution since solution tend to have a lot of its own symbols that doesn't appear in other solution. so not 100% sure what we get by training through random user's random solution. we already hold onto all metadata in memory, so symbol from like .net fx is not concern here. all cache we are talking about is from source of a solution.

there won't be a whole lot of data to train on I think.

can you give me reason why you think it is not enough? when something like errors are calculated, or completion set is calculated, or signature help is completed. it will touch cache like hundreds of times. it is like simple action of user typing 1 character in IDE, will cause hundreds of data point. also, analyzing the solution itself will gives you a lot of data assuming there is million lines of code.

so what I was thinking was, you have a lot of raw data from the code, you can extract out common relationship or pattern from the code itself, and then while user is typing, you can adjust the prediction from cache hit/miss. we can also know context since you know what type/symbol/text is requested for why.

just set breakpoint on something like GetSyntaxAsync or GetTextAsync or GetSemanticModel or Symbol constructor, and type a character in IDE and see how many times you get hit.

jamesqo commented 6 years ago

getting user data is privacy issue. not sure whether you heard about GDPR.

Haha I'm sure everyone with a functioning email address has heard of it :wink: But still, doesn't Visual Studio already send usage data to Microsoft in some form or another?

and what I was looking for was not generic cache but one that specific to this solution since solution tend to have a lot of its own symbols that doesn't appear in other solution.

ML algorithms don't need to be tied to specific data. We can ask questions like

which describes how the symbol relates to other symbols in the cache, instead of anything specific about the symbol itself. The ML algorithm can make a decision based off of this.

it will touch cache like hundreds of times. it is like simple action of user typing 1 character in IDE, will cause hundreds of data point.

I didn't realize that! In that case, it's much more feasible to use ML here.

jamesqo commented 6 years ago

@heejaechang Please help me out here. I got around to setting up Roslyn on my machine and debugging it in order to better understand how exactly the cache is being used. I made a new console app like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        [|]
    }
}

when I set a breakpoint on the Symbol constructor and type a letter like 'e', I do see that it gets hit a lot of times. However, when I set a breakpoint on SimpleMRUCache.Touch() it's not hit at all. I don't see caching logic in the call stack where the Symbol constructor is called, either.

heejaechang commented 6 years ago

that is usually for closed files. things that not handled by the active cache (usually open files) http://source.roslyn.io/#Microsoft.CodeAnalysis.Features/Workspace/ProjectCacheService.cs,28

see these http://source.roslyn.io/#Microsoft.CodeAnalysis.Features/Workspace/ProjectCacheService.cs,26

for open files.

heejaechang commented 6 years ago

@jamesqo so are you planning to implement this ? (http://www.jaleels.org/ajaleel/publications/isca2010-rrip.pdf) for Roslyn?

that would be awesome.

jamesqo commented 6 years ago

@heejaechang

so are you planning to implement this ?

Of course, I think it's really interesting. I'm just trying to work through the code right now and understand things a bit better before I go delving in.

jamesqo commented 6 years ago

@heejaechang Looked into the code. As I understand it, right now we are using weak references to do caching, so the GC decides when an object is evicted. In order for RRIP to work, we ourselves have to decide when to evict the object, not the GC.

So in order words, I think we would have to somehow signal to the GC "collect this weakref before this one because it's less likely to hold relevant data." I don't know if it's possible to do this in .NET?

jamesqo commented 6 years ago

So in order words, I think we would have to somehow signal to the GC "collect this weakref before this one because it's less likely to hold relevant data." I don't know if it's possible to do this in .NET?

/cc @jkotas

CyrusNajmabadi commented 6 years ago

Java has the concept of a 'SoftReference' which they document as: Soft references are most often used to implement memory-sensitive caches.

The idea of a SoftReference is that, unlike a WeakReference, it's not neessarily going to go away during a GC if it only soft-referenced. In other words, the GC willl run, and only collect teh viable soft references, if it still thinks there's a strong need to release memory to the system. otherwise, if there is no such pressing need, it will allow hte soft referenced values to stay around in case the program may possible need them in the future.

Does .net have any sort of equivalent?

heejaechang commented 6 years ago

@jamesqo if possible, I want this ML logic not to tightly coupled with our actual implementation as I wrote above.

we want to consult this ML box whether we should cache or not or evict or not, but we don't want it to actually cache or evict anything.

heejaechang commented 6 years ago

here are the comment I wrote above (https://github.com/dotnet/roslyn/issues/27108#issuecomment-391954400)

jamesqo commented 6 years ago

@heejaechang Ah, but RRIP deals with the actual implementation. It assumes a fixed-sized cache (e.g. limited to N entries), and assigns an integer to each cache entry that represents its priority for eviction. That won't work here since the GC decides what order things are evicted, not us, since we're using weakrefs.

jamesqo commented 6 years ago

@heejaechang I don't know if going to be possible to get thru this w/o touching the caching implementation. Currently, weakrefs are being used to implement the cache. With weakrefs, you cannot tell if it was a good idea to keep them around, i.e. you cannot detect whether or not their Target property was accessed before they are GC'd. As a result, the ML algorithm has no way to tell whether the decision it made (cache/don't cache) resulted in more cache hits/misses.

jamesqo commented 6 years ago

@CyrusNajmabadi Looks like there isn't one in C#. https://stackoverflow.com/a/7102321/4077294

heejaechang commented 6 years ago

we can give something concrete to the ML box (which can think of it as the actual entry and hold onto it). and when it says, I want to evict the entry we can convert that to actual tree/compilation/text and evict from our implementation.

I want this kind of abstraction so that we can mix this with other caches as well like MRU or LRU or any other cache we can do explicitly for certain feature since we know exactly what is needed.

jkotas commented 6 years ago

Java has the concept of a 'SoftReference' Does .net have any sort of equivalent?

It does not today. @Maoni0 has been looking recently into whether we should have an API that makes it easy to implement memory-sensitive caches.

jamesqo commented 6 years ago

@heejaechang

we can give something concrete to the ML box (which can think of it as the actual entry and hold onto it). and when it says, I want to evict the entry we can convert that to actual tree/compilation/text and evict from our implementation.

Is this something along the lines of what you'd want: https://gist.github.com/jamesqo/ad87331e90a6b8a4ea744f9700e070e2 If not provide a code example of the abstraction you'd want to define.

CyrusNajmabadi commented 1 year ago

Closing out. Thsi does not seem actionable. These systems have also been removed from roslyn.