General performance question (comparing Nightly to 1.1)

marksmeltzer commented 7 years ago

I've got a console app and library I'm working on that targets CoreFx 1.1 and 2.vNext (the nightly). I also target the full framework.

This app (which will be released as OSS soon) parses text documents and spits out new text documents. It hits the System.IO APIs (mostly Path.GetFullPath), System.Text APIs (mostly StringBuilder), the collection classes (mostly Dictionary<,>), and System.Text.RegularExpression API (mostly matching).

The CoreFx 1.1 build (compiled in release mind you) runs about 25% slower than the full framework build (4.6.2) of the app. The CoreFx 2.0 nightly build (also in compiled in release) runs about 40% slower than full framework and 15% slower than CoreFx 1.1 version of the app (for the code compiled against my local copy of the nightly, I did a release publish build just to be sure that I was running against the latest DLLs).

Are there any ingrained debug flags, etc., that need to be disabled (perhaps by command line option?) when compiling the latest CoreFx sources to get maximum performance? Or is there some other general reason that might explain the performance drop I'm seeing from 1.1?

This is the same exact code running on all three platforms. I will be publishing this project soon. So it might be worth looking into at that point. I don't know if the profiling tools are working for the latest CoreFx builds yet. If so, I'd be happy to post profiler results if anyone can point me to some documentation on how to profile when dogfooding a local build of CoreFx.

I'm sure the focus right now is more on completeness and correctness. But I am a bit of a performance hawk. So I thought I'd ask.

danmoseley commented 7 years ago

@marksmeltzer I'm not aware of any flags. I am definitely interested in this though.

Can you use PerfView to gather traces on CoreFX 2.0, and on Desktop? With a difference that size, it should be fairly easy to see where the regression(s) lie.

/cc @brianrob @vancem

brianrob commented 7 years ago

@marksmeltzer, PerfView is the right way to go. You'll be able to see exactly what's happening in each case.

It's definitely possible that you've run into a case where the underlying library code is different.

To my knowledge, as long as you are building release for your binaries and for the .NET core binaries, you should be good to go.

If you need any help with PerfView or interpreting results, let me know.

marksmeltzer commented 7 years ago

@danmosemsft @brianrob, I'm happy to provide dumps, however I've never used PerfView before (call me new fashioned). Could you provide me a link or basic guideline on which options to set to get the data you want to see?

I'm guess I'm spoiled by the click-to-run profiling tools from RedGate and Jetbrains.

brianrob commented 7 years ago

cc: @DrewScoggins

brianrob commented 7 years ago

@marksmeltzer, You should be fine with the defaults as they are selected here. If the app turns out not to be CPU bound, you may wish to collect Thread Time, but a default trace is a great place to start.

Also, it may be worthwhile to look at the PerfView tutorials. There is a videos link at the top right of the main PerfView window. The PerfView help also has a built-in tutorial that should help you get acquainted. The help is really quite good.

marksmeltzer commented 7 years ago

I'm updating my local tooling to latest CoreFx sources... Will try PerfView once that is completed. I'm assuming y'all want to see the results from .NET 4.6.2, CoreFx 1.1 and CoreFx 2.latest.

What is the general expectation regarding performance for .NET 4.6.2 vs. CoreFx 1.1? Should we be expecting them to be on par or are there cases where we expect CoreFx to be (noticeably) slower? I was kind of hoping that CoreFx builds would always be >= the performance of .NET builds :)

brianrob commented 7 years ago

Yeah, I'm interested to see the traces. In general, we were aiming for parity or even better in some cases. However, there are certainly cases where things are worse. The "why" can vary from case to case.

marksmeltzer commented 7 years ago

Okay, the typical pattern for the console app I'm working on is a single pass. It's like a compiler. It transforms from text inputs into text outputs. On a single run, I reliably see the disparage between .NET, CoreFx 1.1 and CoreFx vLatest that I originally mentioned in the first post.

I was curious how startup costs, caching, etc., might be affecting those results. So, I added a -samples <n> command to my app which just runs the logic in a loop (in memory only so that it is CPU bound), and only does the disk outputs after measuring the timing of the final sample. I set it up so that when samples > 0, the timer starts after the 0th sample. That allows the app to fully warm-up before measuring the timing on the additional samples. When sample = 0, it measures the time of the 0th sample only (not counting time to start the app itself), before writing the disk outputs. That way -samples 0 gives the warm-up time.

Here are my timing results (best of 20 runs):

Even by adjusting for the warm-up time, the rule seems to be that .NET > CoreFx 1.1 > CoreFx vLatest in terms of performance. Ignoring the warm-up though, the difference between 1.1 and vLatest is pretty small. The biggest difference is in the seen in the 0th run. For my app, the 0th run timing will play a role though since the app will often by invoked from the CLI. So those startup costs do count.

I will collect PerfView data for the 0 samples and for 10 (since that's where the data seems to converge) and provide that for further analysis.

marksmeltzer commented 7 years ago

@brianrob I've got the PerfView data... Could you email me to initiate the exchange? I don't want to post it publically. I'll want to send you encrypted ZIPs. Seems Github nuked their PM feature. I can be reached by my name "mark" at the domain "livingagile [dot] com".

At first glance, it looks like a lot of the time is in the JIT phase.

On a side note, PerfView is a very cool tool. I will definitely be using this more in the future. I like the system-wide collection aspect, and the fact that it works with CoreFx without a hitch. It's a deep tool. Lots to learn. I'm guessing the profilers I'm accustomed to are just views across essentially this same data.

marksmeltzer commented 7 years ago

@danmosemsft I sent the traces to @brianrob earlier this evening.

@brianrob, feel free to share the data internally with others at MSFT as needed.

I'll have to read up more on PerfView to see if there's a straightforward way to strip out all the data for extraneous processes to (a) make the files smaller and (b) so I wouldn't have to worry about security of the data. I don't care too much about the security of the trace data for the app itself since it is headed for an OSS release.

Or if y'all have a more straight forward way for customers to share secure data back to MSFT, that would probably be better than the method I used. I posted a secure ZIP to my company onedrive, shared a link to Brian, had him email me a PGP public key, and then sent him the encrypted ZIP password. I can do that again in the future if needed, but it is a few hoops.

danmoseley commented 7 years ago

@marksmeltzer sounds good. I don't think we have an "official" way -- usually OneDrive or suchlike. I'll be interested to see what Brian finds. if we can fix it, great.

vancem commented 7 years ago

@danmosemsft Just FYI, PerfView knows how to save the data it uses in any 'Stack Viewer' to a file (called a *.PerfView.xml.zip file. This file is much smaller than the ETL.ZIP file and only has the data in the view.

If you are concerned about anyone seeing the other processes you can do the following

Open the CPU view (I am assuming this is what you care about) when the view opens select the process of interest. This sets the 'Inc Pats' to filter out any other processes (but the samples are still there, simply not displayed)
Resolve any native DLLs that seem to be interesting (have large CPU counts)
Go to the 'CallTree' view and select the cell that represents the metric for the entire process of interest. Then right click -> Drill Into. This creates a new view that contains ONLY those samples.
Use the File -> Save to save as a PerfView.etl.zip file. Post that here. It will be small and only contain the samples in that process (basically what you can see in that sub-view you created).

karelz commented 7 years ago

Re: "official" way of sharing data. We have a way to get data from customers via official (paid) support. 'Secret' data from OSS community is harder and undesirable, because there are legal implications on MS - without proper handshake and legal arrangement, it is border line what we can do with the data and what we can't. Not that MS engineers are trying to be malicious, but it is dangerous waters for the company, which could backfire on us (engineers) who are trying to help as much as we can. From that perspective I think it is easiest just to ask external people to choose whatever transfer method they prefer. The rule should be -- if the data is secret, don't send it to us.

BTW: It is the same as with any other OSS project -- if you have problems with JSon parser NuGet package, you have to choose if you are willing to send repro/data to the owner of not. The fact that the receiving person is Microsoft employee should not change much in your decision what you want to share "semi-publicly" and what you don't want to. In both cases you run with certain risk.

marksmeltzer commented 7 years ago

@karelz, fair point about the legal implications of receiving files from the OSS community. I'm fine with using the method that I did use, but if there was a way that's easier, I wanted to ask. (e.g. @vancem's suggestion is a good one.)

For the record though, I am a paid Visual Studio Ultimate + MSDN subscriber (or whatever it's called nowadays). 😃 So if that opens up an easier way, then maybe we'll pursue that in the future. But I'm not looking to burn any of my official support asks on CoreFx work yet either. So that might be moot.

@vancem, I will give that a try. That was my first dive into PerfView. There is probably nothing I would not mind having in the open in there, but I don't yet know what all it collected. So, I erred on the side of caution. Chopping it down to just my app's stack trace seems like a solution to the security concern.

@brianrob, is the stack trace view data that @vancem mentioned the only thing you need out of the dataset? In other words, would that eliminate anything else you wanted to see? If so, are there additional export steps to slice out that subset of the data?

vancem commented 7 years ago

There is probably nothing I would not mind having in the open in there, but I don't yet know what all it collected

Generally speaking you have to be reasonably paranoid to worry about what a ETW CPU trace shows you. It shows you the names of lots of files, and the names of lots of functions/methods/types that get executed. It also collects WHERE your networking packets go and how big they are (including potentially URLs and remote SQL commands). But generally speaking user data is NOT in there. So if you don't mind leaking file names (not data), URLS and remote SQL statements you are probably fine. That said, if you worry about privacy, you should not publish ETL files.

marksmeltzer commented 7 years ago

@brianrob Let me know if you've got what you need. Otherwise, I can follow @vancem's advice and post just the relevant stacktraces in unencrypted zips here.

karelz commented 7 years ago

@marksmeltzer if you can, please send unencrypted data - it is much easier to deal with and Brian can save the time figuring out how to use the tools you pointed him to (he mentioned it on Thu). Unencrypted data, like high-quality bug reports, are easier to reason about and deal with and typically lead to faster response ;-).

marksmeltzer commented 7 years ago

Understood. Here are the stack trace files:

I resolved all the symbols before saving, but I don't know if that was necessary. I like how small those files are. Very nice. The raw ETL files were quite large.

marksmeltzer commented 7 years ago

Here are some additional views in case they are helpful:

brianrob commented 7 years ago

@marksmeltzer, thanks for posting these. I had not yet been able to get access to the files. I'll take a look at these.

brianrob commented 7 years ago

@marksmeltzer, I took a look at these traces this morning. The biggest differences that I see in the 10 samples traces:

JIT compilation time: Lowest - Desktop, Highest: Nightly
Regular expressions: Lowest - Desktop and Nightly (equal), Highest: Core 1.1

In the 0 samples traces the JIT compilation difference pops to the top - it looks like regular expressions isn't used as much (probably because there are 0 samples being handled).

The JIT compilation difference between desktop and core is likely to be known, as the pre-compilation technologies are different. Full framework uses NGEN while Core uses Ready to Run (R2R). There are a handful of generic instantiation types that R2R is not able to pre-compile that NGEN can, and thus we get a difference in JIT time during program execution. If you can share the JIT Stats views for the traces (they should just export to HTML files), I can confirm.

What is interesting to me is that the JIT time jumps up between v1.1 and the nightly build. I would have expected that to be much closer. The JIT Stats views should help to investigate this as well.

For regular expressions, I'm a little surprised that v1.1 was higher while desktop and nightly were essentially equivalent. It makes me wonder if either something else on the box interfered with the app during collection. I haven't looked, but I doubt that RegularExpressions changed significantly across these releases. Does this difference repro across multiple runs of the app?

Hope this helps.

vancem commented 7 years ago

An important aspect of just in time (JIT) compilation is that for any particular method it only happens once per program execution. Thus it can be an issue in 'startup' scenarios and scenarios where processes start up and shut down rapidly (e.g many times a second). However it should not be an interesting factor if execution is longer (e.g programs run for seconds or more).

It is not clear whether your important scenario falls into the regime where JIT compilation matters. If your 'real' scenario is longer lived, then it may be as simple as making your benchmark run longer to 'fix' the problem (so that your benchmarks represent what is important to you).

There are techniques that can be used to minimize JIT compilation time (precompilation and background JIT compilation being two that spring to mind). However first we should determine if it really is a concern (you really need to care about startup or have very short lived processes (that you can't merge), where JIT time is problematic

marksmeltzer commented 7 years ago

@vancem, I think the data above show startup JIT is the root cause of the large difference on the 0th run. That's where the app starts, executes a single task, and then shuts down. So I pay the full JIT price for that run. In the rest of the data where I'm collecting multiple samples for profiling purposes, the JIT impact is mitigated and thus the sample times between the three platforms are pretty similar.

Is there some documentation you could point me to for CoreCLR on how to engage alternate JIT behavior? Googling (oops I mean Binging...) for CoreCLR and CoreFX related docs is pretty painful right now. I'm starting to only use "CoreCLR" and "CoreFX" (and never ".NET Core" <-- that really confuses the searches) and that seems to help the search engines find what I want.

I'm happy to try the alternatives and post the results. If the background JIT for CoreCLR is the same as .NET, I'm already familiar with that API and I can give it a try and see what happens. However, as @brianrob suggested, I'm surprised at the large jump in JIT times from 1.1 to the nightly. There may be cause for concern there -- some optimization may be warranted on CoreCLR for the 2.0 release to bring it back to parity with 1.1 (although better is always nice).

Does CoreCLR even support precompilation anymore? AFAIK the netnative support was pulled and only applies to UWP. Perhaps CoreCLR has its own ngen or something similar?

vancem commented 7 years ago

PerfView (the tool you used) has some documentation in its 'JitStats' report, while written for .NET Desktop is still generally applicable to .NET Core. In particular the docs on Background JIT compilation are applicable.

You can precompile for .NET Core using a tool called crossgen (it is part of the .NET runtime nuget package). I don't know of documentation on it over and above what the /? produce (but I have not looked carefully. We are not encouraging it yet, because there is some subtlety in its use. In particular you can pass a /ReadToRun option that makes a version resilient native image (thus other DLLs or the runtime can change and that image is still correct), or NOT (which precompiles to better code, but is version fragile (any changes to the runtime or other DLLs it depends on require that the image be re-crossgened. This later form only really makes sense if you publish .NET Core as part of your app.

marksmeltzer commented 7 years ago

@brianrob, looks like I collected the trace data using my custom build of CoreFx. I have made improvements to the runtime performance of the regex engine. In the sample timing data I posted (the image with the table), I was careful to run those on the stock nightly of CoreFx. The timings for stock nNext are slightly slower than 1.1 across all sample sizes > 0.

When I got around to use PerfView, I forgot to switch my local CoreFX branch back to master, rebuild, etc. So we got my improvements in there in the PerfView data. That's why nNext appears faster at regex in the traces. I'll collect another dataset using vNext stock later today and upload for comparison.**

Here's the JIT Stats you requested:

** This is outside the bounds of the present conversation, but I will need to look more into the code submission process. I'm planning to submit a pull request for my changes to the regex engine. One of them has a lot of bang and it's only like 4 lines of code. The other is a bit more extensive and I'm not 100% sold on it. I made the two changes in separate branches so I'll submit them separately and see if they get accepted. I'm pretty sure the first change will be accepted. (Hopefully if the changes are accepted they will bubble up to .NET full as well.) Together, in my app, I'm seeing about a 4% net improvement over many samples. If I were to add isolated performance sampling to analyze just the regex impact, I'm guessing it's more like a 20%-40% isolated improvement (on the hot path where the optimizations apply).

I'm actually leaning toward a copy-and-paste of the regex engine to Regex2 and then optimizing the hell out of it. My current approach had to keep backward compatibility and there are several changes I wanted to make that aren't allowed by contract (viz., netstandard requirements). I would happily contribute Regex2 to the platform as well (it would come with other goodies like a faster StringBuilder2, and possibly a faster Dictionary2) in the form of a nuget package. That work probably wouldn't belong in the CoreFX release.

marksmeltzer commented 7 years ago

@vancem, Thanks for the tip! I will have a look at the crossgen tool.

Background JIT would probably work really well for console apps in general because the JIT profile key can based on the impactful set of runtime arguments (e.g. CLI options set by the user). I'm guessing that strategy would provide a pretty decent win on the .NET framework version of the app as well.

Thank you for mentioning background JIT. It's a feature I often forget about because I don't often work on apps where it could be used very advantageously.

marksmeltzer commented 7 years ago

@brianrob, I just looked at the the JIT stats for the vNext samples I posted above, and it's heavy on JITing some of the framework DLLs from my custom build. That is skewing the results a LOT. I'm running a stock build now and will post the revised JIT stats and traces for vNext Stock once that completes.

I will probably get the data posted within the next few hours.

I'll also update my projects to just use the nightly build directly. And post those stats too.

I'm curious if that would show a difference compared to a locally compiled version of the nightly. If the huge jump I'm seeing is because the nightly release has some precompilation versus my local build having none, that could be all the difference.

danmoseley commented 7 years ago

@marksmeltzer re your code contribution comment - we certainly welcome any PR that improves performance so long as (1) it's backed up by profile data on a range of realistic inputs (2) the code complexity is acceptable -- for hot functions we can tolerate some cleverness, for less interesting areas we would not (3) it doesn't introduce breaking functionality changes (4) any memory or footprint change is acceptable -- latter is mostly about generics.

As you say, we have a very high bar for duplicating functionality wholesale - it has little value in the framework as it causes confusion and it is hard to get people to move over, amongst other problems. We have done it when we are forced to make a complete break with the past in order to move forward. Such types would certainly be worth putting in your own package.

marksmeltzer commented 7 years ago

@danmosemsft, that's what I expect from a framework: high bar for backward compatibility (and code sharing across target platforms, which is I think your point about wholesale duplication). When CoreCLR and CoreFX were first announced, I had the initial thought of how great it would be to start fresh and do everything "right". However, that would also have been no backward compatibility and rewriting more libraries, apps, etc. Not a very good selling point for attracting an audience.

On the other hand, I think in the CoreCLR world, there will be a sizable subset of the market shooting for maximum performance and that segment would happily sacrifice backward compatibility to gain an edge. The increased performance stats would be the marketing vehicle on which to build a robust community and get the popular libraries ported. (In the mobile/embedded/IoT world, performance is often measured by the watt and any gains that yield better power usage are highly desirable.)

I think the CoreCLR/CoreFX platform can probably have a foot in both markets eventually as it matures. The current strategy of unifying around netstandard 2.0 seems like a great milestone to shoot for. The broad backward compatibility will win a lot of developers over to use the platform.

For some performance enhancements, like what I'm going to shoot for with Regex2, etc., they may not be able to survive as pure external nuget packages. The reason is that many of the core functions that make the core library work are internals. When making an external library, those APIs disappear. Moreover, the desired goal will often require updating the contract on an existing API to add another overload, etc. (which isn't a breaking change, but such things do affect the CoreFX library). I have not looked through the regex source code in depth yet. I don't know how many internal dependencies it has and whether it would be feasible to extract as a standalone. I'm not really looking to do a full rewrite.

The new string constructor that I want to add, for instance, is definitely something that has to live in CoreFX (or maybe string is defined in CoreCLR? I haven't checked yet.). So it is definitely a mixed bag.

The nice thing about CoreFx is that I can package it up and ship it. So if I want to change something, I can. That's great flexibility.

jkotas commented 7 years ago

I don't know how many internal dependencies it has

Zero. It should be pretty easy to make a standalone version out of it. It depends on other public APIs only: https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System.Text.RegularExpressions.csproj#L46

many of the core functions that make the core library work are internals

We want to hear about any internal secret sauce used by the core library that is needed to build great upstack libraries. We would want to look into exposing it via public API.

marksmeltzer commented 7 years ago

@jkotas,

First, when I said "internals" I mean things declared with the "internal" access keyword. I agree with your assessment that the regex project is only referencing the public APIs, but it may still have dependencies on some of the APIs declared as "internal". If the regex library isn't a "friend" of the core libraries then we can safely assume that it doesn't reference any internals. If we're on the same page about the meaning of "internal" then that's fantastic news: it would mean getting a regex2 clone going would be as simple as a simple copy-and-paste + rename.

Second, if you really want to see a list of internal APIs that I would like to see exposed, that is wonderful news. I am liking this new community-first model. Or even if it is a community-too model, it is a still great advent. Anyway, please prepare yourself for a deluge 😎 There are many internal APIs that I crave access to. I will post a separate issue about that topic in the near future and give you a mention on it. Maybe others will contribute their ideas too.

jkotas commented 7 years ago

the regex library isn't a "friend" of the core libraries

regex is not friend of the core libraries. FWIW, we are trying to not use friends at all in the core libraries - not completely there yet, but close.

I will post a separate issue about that topic in the near future

Perfect. Here is a process that we follow for adding new APIs: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md . It may be useful to open an issue per set of similar apis to feed into this process naturally.

marksmeltzer commented 7 years ago

@jkotas, wow! I did not know that was one of the design goals for CoreFX. I think that's great. I'm still getting my feet wet here... I'm used to the assumptions I've built up from working with .NET since the early alpha days. Lots of biases and assumptions need revising.

I guess I'm left wondering if this will have any impact on the full framework down the road. If we expose a really nice API in CoreFX but not the full framework, it won't be as much of a win for me. However, if work on CoreFX will be driving full framework, at least in part, that would be really nice (eg. adding the newly exposed APIs to netstandard). Hopefully that's part of the vision.

In the long run, CoreFX might grown up to completely replace the .NET framework. That would be nice in many ways. But in the meantime, synchronization between the two frameworks will be a key factor for developers working on both platforms. Especially for developers like me working on products that share code across all platforms.

brianrob commented 7 years ago

@marksmeltzer, thanks for the JitStats views. They show a diff both in the count of methods, time, and even assemblies that have JIT time. The diff of JIT time is about 240ms.

The three assemblies that have JIT time on 2.0 but not 1.1 are:

System.Text.RegularExpression.dll: 115ms
System.IO.FileSystem.dll: 22.3ms
System.Console.dll: 13ms

This doesn't account for the whole diff, but is a good starting point. There are also other assemblies that have more methods being compiled on 2.0 vs. 1.1.

A couple of questions:

Can you tell me how you put together the 2.0 repro? Did you use a CLI / shared framework drop to build your app or did you build the framework binaries yourself (and if so, how did you produce a CLI drop)?
Is your app running against 32-bit or 64-bit CoreCLR?

I suspect that there may be a couple of things going on here - answers to these questions will hopefully help to point us in the right direction.

brianrob commented 7 years ago

@marksmeltzer, I just saw your earlier message. Yes, if you're building the CoreFX libraries on your own, they are likely not pre-compiled. As part of production of the shared framework, they get pre-compiled. Thus, the binary builds posted on http://github.com/dotnet/cli will have better start-up characteristics than if you build the binaries directly in corefx and patch them into an existing drop of the SDK.

That may well be at least one of the factors contributing to the numbers you saw. Having measurements on a binary drop from http://github.com/dotnet/cli will help to confirm this.

marksmeltzer commented 7 years ago

@brianrob precompilation FTW. That didn't dawn on me until I saw the JIT stats view for 1.1 versus vNext using my local build. That made it pretty obvious it was spending a lot of time JITing my locally built CoreFX DLLs.

Apologies for any head scratching that may have caused on your end. I'm not used to this new power of being able to compile my own framework DLLs, and had taken for granted the fact that they get precompiled in the official release.

I will post new stats once I get everything synced up and do the vNext tests using the official nightly release assets to ensure parity with the other platforms' runs. That will close the gap significantly, There might not even be an issue to address. That would be a good finding.

brianrob commented 7 years ago

@marksmeltzer, no worries at all. I'll be interested to see the updated results.

karelz commented 7 years ago

From skimming the thread it looks like there is not known problem with 2.0. Moving to Future. If there is nothing left to track on this (long) issue, please close it. Thanks!

marksmeltzer commented 7 years ago

@karelz it has been a winding journey, mostly due to my errors. However, there is still a noticeable performance drop on vNext. I will be posting data for @brianrob soon for review.

karelz commented 7 years ago

Thanks! We can upgrade the bug back to 2.0 when we have some evidence (@brianrob can take care of that)

marksmeltzer commented 7 years ago

Here's the latest results:

vNext = 2.0.0-beta-001859-00
Time(sec) = Best time over 20 regressions
All three platforms had very low standard deviation (so best time is indicative)
All three are Any CPU builds running in x64 on Windows 10 x64 platform
Samples 0 shows timing for init, caching, etc. plus execution of one sample
Samples N shows begins tracking sample timing after the 0th run, so it only counts the pure execution time (samples are run within the same process)
When collecting N samples, data is only written to memory to ensure the task is CPU bound

Here are the stack traces:

And the JIT stats (contains the HTML files):

AspenCLI.JITStats.zip

Here's the JIT highlights for the Core 1.1 (0 samples) run:

Here's the JIT highlights for the Core vNext (0 samples) run:

The bulk of the JIT increase (only ~20 ms) was in the user DLL (LivingAgile.Aspen.Core).

If I am reading that data right, then the extra JIT time accounts for ~0.02 sec of the increased time for vNext over 1.1, which leaves vNext taking an additional ~0.003 seconds beyond that. So, the extra JIT time does seem to account for most of the increase. (There also seems to be a discrepancy on the set of the JITted DLLs between 1.1 and vNext, but it doesn't seem to affect the timing overmuch.)

Is that an acceptable increase per planned design changes to the JITter in vNext? Or does it suggest something that needs to be fixed?

The increased timing numbers are small, but it is an overall 20% increase in total execution time on the 0th run compared to 1.1.

marksmeltzer commented 7 years ago

@brianrob if these numbers do not bother you and the team, I will close the issue.

These numbers are essentially the same as I reported in the initial issue/question.

karelz commented 7 years ago

Do you see the slow down in JIT for all JITting scenarios, or is it specific to this particular scenario? We might want to loop in JIT folks to take a look ... I let @brianrob first confirm the regression and then route it appropriately.

marksmeltzer commented 7 years ago

@karelz this is my one and only CoreFx app thus far. So I don't have any more scenarios to throw at it. Yet. There is more in the pipeline though. The good news is the runtime performance numbers look very solid.

For my app, I can work around the startup time issue. I'm planning to try background JIT next week (or later this week) and see what that does. But I will likely end up with a TSR server that the client CLI app connects to in order to minimize the cycle time for individual client runs. So, for me this isn't a killer issue.

I think my app/library are on the fairly simple side though. So the jump in JIT times between 1.1 and vNext was unexpected. If there's a good reason for it, I will accept that and move on 😄

marksmeltzer commented 7 years ago

@brianrob is this still being looked into? Or should I close it out?

My solution will be going up on github next week (knock on wood). So those assets could be leveraged for direct review if needed.

AndyAyersMS commented 7 years ago

Speculating here, but I wonder if the difference in jit time is that the 1.1 jit was built with PGO and the 2.0 jit is not. 10% is in the right ballpark for this.

karelz commented 7 years ago

@AndyAyersMS @brianrob do you have a suggestion how to confirm that? I assume generating PGO for 2.0 JIT is a lot of work. What about creating custom 1.1 without PGO for comparison? Could we give @marksmeltzer instructions how to do that?

Another option is to rely on @AndyAyersMS's guess and trust that any significant regression in JIT would be caught by our internal perf runs ... (do we have such results compared to 1.1 ... with PGO?)

AndyAyersMS commented 7 years ago

I believe in CoreCLR you need to do something to opt-in to PGO builds, so any release bits you build yourself are non-PGO. So likely one could enlist in the 1.1.0 branch and build release there and drop the resulting clrjit.dll into the installed CLI instance to get a "non-PGO" jit time for 1.1.0.

@dpodder can speak more authoritatively on when and how PGO kicks in during builds.

For windows binaries, you can see if the bits are PGO optimized by looking at the debug directory section in the PE file (eg via link -dump -headers clrjit.dll). For instance the 1.1.0 clrjit.dll shows:

  Debug Directories

        Time Type        Size      RVA  Pointer
    -------- ------- -------- -------- --------
    58139069 cv            6D 000E7910    E6910    Format: RSDS, {4AC8024F-8A37-4B35-BC57-6E683C330D4E}, 1, D:\A\_work\33\s\bin\obj\Windows_NT.x64.Release\src\jit\standalone\Release\clrjit.pdb
    58139069 feat          14 000E7980    E6980    Counts: Pre-VC++ 11.00=0, C/C++=136, /GS=136, /sdl=0, guardN=135
    58139069 coffgrp      450 000E7994    E6994    50475500 (PGU)

and the last line indicates the build used PGO (here in the "update form" aka PGU).

Doing the same for a 2.0.0-preview1-001915-00 clrjit.dll shows that it was only built with LTCG and not PGO:

  Debug Directories

        Time Type        Size      RVA  Pointer
    -------- ------- -------- -------- --------
    58E34963 cv            6D 000CAA8C    C928C    Format: RSDS, {45197198-669D-4980-B3CC-C47AB0BA17EB}, 1, E:\A\_work\56\s\bin\obj\Windows_NT.x64.Release\src\jit\standalone\Release\clrjit.pdb
    58E34963 feat          14 000CAAFC    C92FC    Counts: Pre-VC++ 11.00=0, C/C++=138, /GS=138, /sdl=0, guardN=137
    58E34963 coffgrp      324 000CAB10    C9310    4C544347 (LTCG)

So all other things being equal this 2.0 preview jit would be expected to run 10-15% slower than the 1.1.0 release version.

I would expect sometime fairly soon the official 2.0 preview drops will start being PGO'd. Daniel probably knows more about this too.

marksmeltzer commented 7 years ago

Is that turned off for now to reduce build times? To support debugging? Something else?

That could well be the majority of the difference.

The last comparison chart I posted of 1.1 and 2.0 showed a 20% increase (scroll up this thread), which is a bit higher than the 10-15% expectation. It could be the specifics of the this particular code path though.

EDIT (by @karelz): Deleted the email-inserted previous reply

karelz commented 7 years ago

Historically we turned it off (on daily builds), because the PGO data became stale quickly. And we did not have great automation around the collection for .NET Core yet. Therefore keeping it up and running all the time was high cost, and we had things with higher return of investment to fund. That's why we usually do PGO training data refresh only shortly before shipping a release. Disclaimer: This was the state ~1.5 years ago (when I was on Perf team), but IMO it is likely still accurate info.

For more context: The PGO automated runs were troublesome also on Desktop/.NET Native and it was very costly to keep them up and running on daily basis (we had several pushes of cleanup, and we had to setup health monitoring system for new failures in scenario training). Key sources of troubles were fragility of the automation scripts around PGO collection (at least on Desktop, I think we built brand new ones for .NET Native/.NET Core) and also fragility of the end-to-end scenarios themselves (e.g. Visual Studio in Desktop, real apps in .NET Native, ASP.NET apps in .NET Core). The scenarios were typically not regular tests and required non-trivial maintenance -- hence high cost. It was usually cheaper to resurrect the PGO training lab just before shipping.

dotnet / runtime

General performance question (comparing Nightly to 1.1) #20700