Closed GSPP closed 4 years ago
@GSPP Tiering is a constant topic in planning conversations. My impression is that it's a matter of when, not if, if that provides any solace. As to why it's not already there, I think it's because, historically, the perceived potential gains didn't justify the additional development resources necessary to manage to the increased complexity and risk of multiple codegen modes. I should really let the experts speak to this, though, so I'll add them.
/cc @dotnet/jit-contrib @russellhadley
Somehow I doubt that this is still relevant in a world of crossgen/ngen, Ready to Run and corert.
None of these deliver high steady state throughput right now which is what's important for most web apps. If they ever do, I'm happy with that since I personally don't care about startup time.
But so far all code generators for .NET have tried to make an impossible balancing act between the two goals, fulfilling neither very well. Let's get rid of that balancing act so that we can turn optimizations to 11.
But so far all code generators for .NET have tried to make an impossible balancing act between the two goals, fulfilling neither very well. Let's get rid of that balancing act so that we can turn optimizations to 11.
I agree but fixing this doesn't require things like an interpreter. Just a good crossgen compiler, be it a better RyuJIT or LLILC.
I think the biggest advantage is for applications that need to generate code at runtime. These include dynamic languages and server containers.
It's true that dynamically generated code is one motivation - but it is also true that a static compiler will never have access to all of the information available at runtime. Not only that, even when it speculates (e.g. based on profile information), it is much more difficult for a static compiler to do so in the presence of modal or external context-dependent behavior.
Web apps should not need any ngen-style processing. It does not fit well into the deployment pipeline. It takes a lot of time to ngen big binary (even if almost all code is dynamically dead or cold).
Also, when debugging and testing a web app you can't rely on ngen to give you realistic performance.
Further, I 2nd Carol's point of using dynamic information. The interpretation tier can profile code (branches, loop trip counts, dynamic dispatch targets). It's a perfect match! First collect the profile, then optimize.
Tiering solves everything in every scenario forever. Approximately speaking :) This can actually get us to the promise of JITs: Achieve performance beyond what a C compiler can do.
Current implementation of RyuJIT as it is now is good enough for a Tier 1... The question is: Would it make sense to have a Tier 2 extreme optimization JIT for hot paths that can run after the fact? Essentially when we detect or have enough runtime information to know that something is hot or when asked to use that instead from the start.
RyuJIT is by far good enough to be the tier 1. Problem with that is that an interpreter would have far faster startup time (in my estimation). Second problem is in order to advance to tier 2 the local state of executing tier 1 code must be transferable to the new tier 2 code (OSR). That requires RyuJIT changes. Adding an interpreter would be, I think, a cheaper path with better startup latency at the same time.
An even cheaper variant would be to not replace running code with tier 2 code. Instead, wait until the tier 1 code naturally returns. This can be a problem if the code enters into a long running hot loop. It will never arrive at tier 2 performance that way.
I think that would not be too bad and could be used as a v1 strategy. Mitigating ideas are available such as an attribute marking a method as hot (this should exist anyway even with the current JIT strategy).
@GSPP That is true, but that doesnt mean you wouldnt know that on the next run. If Jitted code & instrumentation becomes persistent, then the second execution you will still get Tier 2 code (at the expense of some startup time) --- which for once I personally don't care as I write mostly server code.
Writing an interpreter seems cheap compared to a JIT.
Instead of writing a brand new interpreter, could it make sense to run RyuJIT with optimizations disabled? Would that improve startup time enough?
A high quality code generator must be created. This could be VC
Are you talking about C2, the Visual C++ backend? That's not cross-platform and not open source. I doubt that fixing both would happen anytime soon.
Good idea with disabling optimizations. The OSR problem remains, though. Not sure how difficult it is to generate code that allows the runtime to derive the IL architectural state (locals and stack) at runtime at a safe point, copy that into tier 2 jitted code and resume tier 2 execution mid-function. The JVM does it but who knows how much time it took to implement that.
Yes, I was talking about C2. I think I remember that at least one of the Desktop JITs is based on C2 code. Probably does not work for CoreCLR but maybe for Desktop. I'm sure Microsoft is interested in having aligned code bases so that's probably out indeed. LLVM seems to be a great choice. I believe multiple languages are currently interested in making LLVM work with GCs and with managed runtimes in general.
LLVM seems to be a great choice. I believe multiple languages are currently interested in making LLVM work with GCs and with managed runtimes in general.
An interesting article on this topic: Apple recently moved the final tier of their JavaScript JIT away from LLVM: https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/ . We would likely encounter similar issues to what they encountered: slow compile times and LLVM's lack of knowledge of the source language.
10x slower than RyuJIT would be totally acceptable for a 2nd tier.
I don't think that the lack of knowledge of the source language (which is a true concern) is inherent in LLVM's architecture. I believe multiple teams are busy moving LLVM into a state where source language knowledge can be utilized more easily. All non-C high-level languages have this problem when compiling on LLVM.
The WebKIT FTL/B3 project is in a harder position to succeed than .NET because they must excel when running code that in total consumes a few hundred milliseconds of time and then exits. This is the nature of JavaScript workloads driving web pages. .NET is not in that spot.
@GSPP I'm sure you probably know about LLILC. If not, take a look.
We have been working for a while on LLVM support for CLR concepts and have invested in both EH and GC improvements. Still quite a bit more to do on both. Beyond that some there's unknown amount of work get optimizations working properly in the presence of GC.
LLILC seems to be stalled. Is it? On Apr 18, 2016 7:32 PM, "Andy Ayers" notifications@github.com wrote:
@GSPP https://github.com/GSPP I'm sure you probably know about LLILC https://github.com/dotnet/llilc. If not, take a look.
We have been working for a while on LLVM support for CLR concepts and have invested in both EH and GC improvements. Still quite a bit more to do on both. Beyond that some there's unknown amount of work get optimizations working properly in the presence of GC.
β You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/dotnet/coreclr/issues/4331#issuecomment-211630483
@drbo - LLILC is on the back burner for the moment - the MS team has been focusing on getting more targets brought up in RyuJIT as well as fixing issues that come up as CoreCLR drives to release and that's taken pretty much all our time. It's on my TODO list (in my copious free time) to write up a lessons learned post based on how far we've (currently) gotten with LLILC, but I haven't gotten to it yet.
On the tiering, this topic has generated lots of discussion over the years. I think that given some of the new workloads, as well as the new addition of versionable ready to run images, we'll be taking a fresh look at how and where to tier.
@russellhadley did you have the free time to write the post?
I hypothesize, there should be something about not promoted stack slots and gcroots breaking the optizations and slow jitting time... I should better have a look at the project's code.
I also wonder if it's possible and profitable to directly jump into SelectionDAG and perform part of LLVM backend. At least some peephole and copy propagation... if e.g. the gcroot promotion to the registers is supported in LLILC
I am curious on the status of LLILC including current bottlenecks and how it fares against RyuJIT. LLVM being full-fledged "industrial-strength" compiler should have a great wealth of optimizations available to OSS. There have been some talks on more efficient, faster serialization/deserialization of bitcode format on the mailing list; I am wondering if this is a useful thing for LLILC.
Have there been any more thoughts on this? @russellhadley CoreCLR has been released and RyuJIT has been ported to (at least) x86 β what is next on the roadmap?
See dotnet/coreclr#10478 for the beginnings of work on this.
Also dotnet/coreclr#12193
@noahfalk, could you please provide a way to tell the runtime to force a tier 2 compilation right away from the managed code itself? Tiered compilation is a very good idea for most use cases, but I'm working on a project where startup time is irrelevant but throughput and a stable latency are essential.
Off the top of my head, this could either be:
<gcServer enabled="true" />
to force the JIT to always skip tier 1RuntimeHelpers.PrepareMethod
, which would be called by the code on all methods that are parts of the hot path (we're using this to pre-JIT our code on startup). This has the advantage of giving a greater degree of freedom to the developer who should know what the hot path is. An additional overload of this method would be just fine.Granted, few projects would benefit from this, but I'm kind of worried by the JIT skipping optimizations by default, and me not being able to tell it I'd rather have it optimize my code heavily instead.
I'm aware you wrote the following in the design doc:
Add new build pipeline stage accessible from managed code APIs to do self-modifying code.
Which sounds very interesting π but I'm not quite sure it covers what I'm asking here.
Also a related question: when would the second JIT pass kick in? When a method is going to be called for the nth time? Will the JIT happen on the thread the method was supposed to run on? If so, that would introduce a delay before the method call. If you implement more aggressive optimizations this delay would be longer than the current JIT time, which may become an issue.
It should happen when the method is called enough times, or if a loop executes enough iterations (on-stage replacement). It should happen asynchronously on a background thread.
On Jun 29, 2017 7:01 PM, "Lucas Trzesniewski" notifications@github.com wrote:
@noahfalk https://github.com/noahfalk, could you please provide a way to tell the runtime to force a tier 2 compilation right away from the managed code itself? Tiered compilation is a very good idea for most use cases, but I'm working on a project where startup time is irrelevant but throughput and a stable latency are essential.
Off the top of my head, this could either be:
- a new setting in the config file, a switch like <gcServer enabled="true" /> to force the JIT to always skip tier 1
- or something like RuntimeHelpers.PrepareMethod, which would be called by the code on all methods that are parts of the hot path (we're using this to pre-JIT our code on startup). This has the advantage of giving a greater degree of freedom to the developer who should know what the hot path is. An additional overload of this method would be just fine.
Granted, few projects would benefit from this, but I'm kind of worried by the JIT skipping optimizations by default, and me not being able to tell it I'd rather have it optimize my code heavily instead.
I'm aware you wrote the following in the design doc:
Add new build pipeline stage accessible from managed code APIs to do self-modifying code.
Which sounds very interesting π but I'm not quite sure it covers what I'm asking here.
Also a related question: when would the second JIT pass kick in? When a method is going to be called for the nth time? Will the JIT happen on the thread the method was supposed to run on? If so, that would introduce a delay before the method call. If you implement more aggressive optimizations this delay would be longer than the current JIT time, which may become an issue.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/coreclr/issues/4331#issuecomment-312130920, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGWB2WbZ2qVBjRIQWS86MStTSa1ODfoks5sJCzOgaJpZM4IHWs8 .
@ltrzesniewski - Thanks for the feedback! Certainly I hope tiered compilation is useful for the vast majority of projects but the tradeoffs may not be ideal for every project. I've been speculating we would leave an environment variable in place to disable tiered jitting, in which case you keep the runtime behavior you have now with higher quality (but slower to generate) jitting up front. Is setting an environment variable something reasonable for your app to do? Other options are also possible, I just gravitate to the environment variable because it is one of the simplest configuration options we can use.
Also a related question: when would the second JIT pass kick in?
This is a policy that is very likely to evolve over time. The current prototype implementation uses a simplistic policy: "Has the method been called >= 30 times" https://github.com/dotnet/coreclr/blob/master/src/vm/tieredcompilation.cpp#L89 https://github.com/dotnet/coreclr/blob/master/src/vm/tieredcompilation.cpp#L122
Conveniently this very simple policy suggests a nice perf improvement on my machine, even if it is just a guess. In order to create better policies we need to get some real world usage feedback, and getting that feedback will require that the core mechanics are reasonably robust in a variety of scenarios. So my plan is to improve robustness/compat first and then do more exploration for tuning policy.
@DemiMarie - We don't have anything that tracks loop iterations as part of the policy now, but its an interesting prospect for the future.
Have there been any thoughts on profiling, speculative optimization, and deoptimization? The JVM does all of these.
On Jun 29, 2017 8:58 PM, "Noah Falk" notifications@github.com wrote:
@ltrzesniewski https://github.com/ltrzesniewski - Thanks for the feedback! Certainly I hope tiered compilation is useful for the vast majority of projects but the tradeoffs may not be ideal for every project. I've been speculating we would leave an environment variable in place to disable tiered jitting, in which case you keep the runtime behavior you have now with higher quality (but slower to generate) jitting up front. Is setting an environment variable something reasonable for your app to do? Other options are also possible, I just gravitate to the environment variable because it is one of the simplest configuration options we can use.
Also a related question: when would the second JIT pass kick in?
This is a policy that is very likely to evolve over time. The current prototype implementation uses a simplistic policy: "Has the method been called >= 30 times" https://github.com/dotnet/coreclr/blob/master/src/vm/ tieredcompilation.cpp#L89 https://github.com/dotnet/coreclr/blob/master/src/vm/ tieredcompilation.cpp#L122
Conveniently this very simple policy suggests a nice perf improvement on my machine, even if it is just a guess. In order to create better policies we need to get some real world usage feedback, and getting that feedback will require that the core mechanics are reasonably robust in a variety of scenarios. So my plan is to improve robustness/compat first and then do more exploration for tuning policy.
@DemiMarie https://github.com/demimarie - We don't have anything that tracks loop iterations as part of the policy now, but its an interesting prospect for the future.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/coreclr/issues/4331#issuecomment-312146470, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGWB5m2qCnOKJsaXFCFigI3J6Ql8PMQks5sJEgZgaJpZM4IHWs8 .
@noahfalk An environmental variable definitely not a solution that would allow you to control this application by application. For server/service apps you usually don't care how much time it takes for the application to startup (I know we don't at the expense of performance). Developing a database engine, I can tell you first hand, we need it to work as fast as it can from the get go and even on unexceptional paths or benchmark done by potential new clients.
On the other hand, given that on typical environments uptime can be measured in weeks at a time, we don't care if it takes even 30 seconds; what we do care is that forcing the user to issue a general switch (all-or-nothing) or even having the user having to care about it (like set by default from the config files) is 10 steps backwards.
Don't get me wrong I am looking more than forward for a Tiered JIT because it opens the path of a high performance take as much time as you need codepath for optimization at the JIT level. I even suggested that myself a long time ago in informal talks with some of the JIT engineers, and you had already it on the radar. But a way to customize the behavior application wide (not system wide) is (at least for us) a critical quality indicator for this particular feature.
EDIT: Some style issues.
@redknightlois - Thanks for the follow up
An environmental variable definitely not a solution that would allow you to control this application by application.
A little confused on this part... environment variables have per-process rather than per-system granularity, at least in the platforms I was aware of. For example today to turn on tiered compilation for testing in just one application I run:
set COMPLUS_EXPERIMENTAL_TieredCompilation=1
MyApp.exe
set COMPLUS_EXPERIMENTAL_TieredCompilation=0
what we do care is that [we don't] forcing the user ... to care about it
I take it that you'd like a configuration setting that can be specified by the app developer, not by the person running the app? One possibility with the env var is making the app the user launches a trivial wrapper (like a batch script) that launches the coreclr app though I admit it seems a bit inelegant. I'm open to alternatives and not set on the env var. Just to set expectations this is not an area I'll be spending active design effort in the very near future, but I agree that having appropriate configuration is important to get to.
Also a heads up - assuming we continue down the tiered compilation path a decent ways, I could easily imagine we reach a point where enabling tiered compilation is not only the fastest startup, but it is also beats the current steady state performance. Right now startup perf is my target, but its not the limit of what we can do with it : )
Have there been any thoughts on profiling, speculative optimization, and deoptimization?
@DemiMarie - They've certainly come up in conversations and I think many folks are excited that tiered compilation opens up these possibilities. Speaking just for myself I'm trying to stay focused on delivering the foundational tiered compilation capabilities before setting my sights higher. Other folks in our community are probably already front-running me on other applications.
@noahfalk Yes, being inelegant also means that the usual process to run it can (and very likely will) become error prone and that's essentially the issue (the only way to be completely sure noone will mess up is doing it system-wide). An alternative that we know it works is that in the same way you can configure if you are going to use the server GC with an entry in the app.config
you can do the same with the tiered compilation (at least until the tiered can consistently beat the steady state performance). Being the JIT, you can also do that per assembly using the assembly.config
and would give a degree of capabilities that currently it doesnt exist if other knobs can be selected in that way too.
Environment variables are often set per-user or per-system, which has the potential negative effect of affecting all such processes, across multiple versions of the runtime. A per-app config file seems like a much better solution (even if per-user/per-system is also available) -- something like the desktop config values that could be set in app.config, but also use env vars or registry.
I think we shall implement the most common path which is per-application. System-wide settings also may be useful, but I don't think we have to think about it before feature get implemented.
Please note we haven't worked out in any detail what the second-tier jit should do for optimization, though we have some ideas. It might just do what the jit does today, but quite likely it will do more.
So let me point out some potential complications....
It is possible that the second-tier jit will bootstrap itself on top of observations made on the behavior of the code created by the first-tier jit. So bypassing the first-tier jit and asking for the second-tier jit directly may not work at all, or may not work as well, as just letting tiering run its course. Possibly a "tiering bypass" option, however implemented, would end up giving code like the code the jit produces by default today, not the code a second-tier jit could produce.
The second-tier jit may be tuned in such a way that running it on a large set of methods causes relatively slow jit times (since our expectation is that relatively few methods will end up being jitted with the second-tier jit, and we expect the second-tier jit will do more thorough optimization). We don't know the right tradeoffs here yet.
That being said...
I think an "aggressive optimization" method attribute makes sense -- one asking the jit to behave somewhat like the second-tier jit might behave for specific methods, and perhaps skipping over these methods during prejitting (since prejitted code runs slower than jitted code, especially for R2R). But applying this notion to an entire assembly or to all assemblies in an application doesn't seem as appealing.
If you take what happens in native compilers as a suitable analogy, performance vs compile time/code size tradeoffs can get pretty bad at higher optimization levels, eg 10x longer compiles for an aggregate 1-2% improvement in performance. The key to the puzzle is knowing which methods matter, and the only way to do that is for either the programmers to know or for the system to figure it out for itself.
@AndyAyersMS I think you hit the nail there. The JIT treating "aggressive optimization" attribute would probably solve most of the issues of not being able to have enough information for the JIT to produce on isolation better code without the first-tier jit having time to provide that feedback.
@redknightlois attribute won't work if we want more tiering: - T3 JIT, T4 JIT, ... I'm not sure if two levels are not enough, but we should at least consider this possibility.
It would be great to be able to use something similar to MPGO to start running with second-tier jitted code. Fast-forwarding the first-tier instead of bypassing it completely.
@AndyAyersMS, has the fact that Azul have implemented a managed JIT for the JVM using LLVM made it any easier to integrate LLVM in the CLR? Apparently changes were pushed upstream to LLVM in the process.
Just fyi, I created a number of work items for some particular work we need to do to get tiered jitting off the ground (#12609, dotnet/coreclr#12610, dotnet/coreclr#12611, dotnet/coreclr#12612, dotnet/coreclr#12617). If your interest directly relates to one of those feel free to add your comments to them. For any other topics I assume discussion will remain here, or anyone can create an issue for a specific sub-topic if there is enough interest to merit splitting it out on its own.
@MendelMonteiro Making MPGO-style feedback data available when jitting is certainly an option (currently we can only read this data back when prejitting). There are various limits to what can be instrumented, so not all methods can be handled this way, there are other limitations we need to look at (for instance, no feedback data is available for inlinees), the instrumentation and training runs needed to create the MPGO data are a barrier to for many users, and the MPGO data may or may not match up with what we'd have when bootstrapping off the first tier, but the idea certainly has merit.
As far as an LLVM based upper tier goes -- obviously we have looked into this to some extent with LLILC, and at the time we were in frequent contact with the Azul folks, so we are familiar with many of the things they were doing in LLVM to make it more amenable to compilation of languages with precise GC.
There were (and likely still are) significant differences in the LLVM support needed for the CLR versus what is needed for Java, both in GC and in EH, and in the restrictions one must place on the optimizer. To cite just one example: the CLRs GC currently cannot tolerate managed pointers that point off the end of objects. Java handles this via a base/derived paired reporting mechanism. We'd either need to plumb support for this kind of paired reporting into the CLR or restrict LLVM's optimizer passes to never create these kinds of pointers. On top of that, the LLILC jit was slow and we weren't sure ultimately what kind of code quality it might produce.
So, figuring out how LLILC might fit into a potential multi-tier approach that did not yet exist seemed (and still seems) premature. The idea for now is to get tiering into the framework and use RyuJit for the second-tier jit. As we learn more, we may discover there is indeed room for higher tier jits, or, at least, understand better what else we need to do before such things make sense.
@AndyAyersMS Maybe you can introduce the needed changes in LLVM also than work around its limitations.
Does Multicore JIT and its Profile Optimization work with coreclr?
@benaadams - Yeah multicore JIT works. I don't recall which (if any) scenarios where it is enabled by default, but you can turn it via configuration: https://github.com/dotnet/coreclr/blob/master/src/inc/clrconfigvalues.h#L548
I wrote a half-toy compiler and I've noticed that most of the time the hard hitting optimizations can be done fairly ok on the same infrastructure and very few things can be done in the higher tier optimizer.
What I mean is this: if a function is hit many times, the parameters as:
It would be also very nice, but maybe this is my dreaming awake, that CompilerServices to offer the "advanced compiler" to be exposed as to be able to be accessed via code or metadata, so places like games or trading platforms could benefit by starting compilation ahead of time which classes and methods to be "more deeply compiled". This is not NGen, but if a non-tiered compiler is not necessarily possible (desirable), at least to be possible to use the heavier optimized code for critical parts that need this extra performance. Of course, if a platform does not offer the heavy optimizations (let's say Mono), the API calls will be basically a NO-OP.
We have a solid foundation for tiering in place now thanks to the hard work of @noahfalk, @kouvel and others.
I suggest that we close this issue and open a "how can we make tiered jitting better" issue. I encourage anyone interested in the topic to give the current tiering a try to get an idea where things are at right now. We would love to get feedback on the actual behavior, whether good or bad.
Is the current behavior described somewhere? I only found this but it's more about the implementation details rather than the tiering specifically.
I believe we're going to have some kind of summary writeup available soon, with some of the data we've gathered.
Tiering can be enabled in 2.1 by setting COMPlus_TieredCompilation=1
. If you try it, please report back what you find....
With recent PRs (https://github.com/dotnet/coreclr/pull/17840, https://github.com/dotnet/sdk/pull/2201) you also the have the ability to specify tiered compilation as a runtimeconfig.json property or an msbuild project property. Using this functionality will require you to be on very recent builds whereas the environment variable has been around for a while.
As we've discussed before with @jkotas Tiered JIT can improve startup time. Does it work when we use native images? We've made measurements for several apps on Tizen phone and there's the results:
System DLLs | App DLLs | Tiered | time, s |
---|---|---|---|
R2R | R2R | no | 2.68 |
R2R | R2R | yes | 2.61 (-3%) |
R2R | no | no | 4.40 |
R2R | no | yes | 3.63 (-17%) |
We'll check FNV mode as well, but it looks it works good when there is no images.
cc @gbalykov @nkaretnikov2
FYI, tiered compilation is now the default for .NET Core: https://github.com/dotnet/coreclr/pull/19525
@alpencolt, startup time improvements may be less when using AOT compilation such as R2R. The startup time improvement currently comes from jitting more quickly with fewer optimizations, and when using AOT compilation there would be less to JIT. Some methods are not pregenerated, such as some generics, IL stubs, and other dynamic methods. Some generics may benefit from tiering during startup even when using AOT compilation.
I'm going to go ahead close this issue, since with @kouvel's commit I think have achieved the ask in the title : D People are welcome to continue discussion and/or open new issues on more specific topics such as requested improvements, questions, or particular investigations. If anyone thinks it is closed prematurely of course let us know.
Why is the .NET JIT not tiered?
The JIT has two primary design goals: Fast startup time and high steady-state throughput.
At first, these goals appear at odds. But with a two-tier JIT design they are both attainable:
Main
method is almost always cold and jitting it is a waste of time.Reaching this architecture does not seem too costly:
Is this idea being pursued by the JIT team?
.NET runs on 100's of millions of servers. I feel like a lot of performance is left on the table and millions of servers are wasted for customers because of suboptimal code gen.
category:throughput theme:big-bets skill-level:expert cost:extra-large