dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
18.96k stars 4.03k forks source link

State / Direction of C# as a High-Performance Language #10378

Closed ilexp closed 6 years ago

ilexp commented 8 years ago

I've been following recent development of C# as a language and it seems that there is a strong focus on providing the means to write code more efficiently. This is definitely neat. But what about providing ways to write more efficient code?

For context, I'm using C# mostly for game development (as in "lowlevel / from scratch") which has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization.

Issues related to this:

Other sentiments regarding this:

This is probably more of a broader discussion, but I guess my core question is: Is there a general roadmap regarding potential improvements for performance-focused code in C#?

DemiMarie commented 8 years ago

Here is my thought:

damageboy commented 8 years ago

@playsomethingsaxman:

On that subject, I absolutely hate that async/iterator methods can't be unsafe. A lot of potential is lost with this blindly religious constraint.

I totally agree with you :) I recently tried to take it up on the roslyn gitter and got into a semi-serious discussion with @agocke about this on April 14th, if you follow it through: https://gitter.im/dotnet/roslyn?at=570fcf5b548df1be102ceded, it's pretty much me, @jmarolf and @agocke for a good couple of pages.

It seems like we gravitate towards the end to a conclusion that it might be ok, although @agocke is really not sure about the voodoo effects of allowing such code... @agocke mentioned that the main hurdle right now is that fact that the async/yield state-machine generator (for lack of better naming) is incapable of doing stack-spilling for unsafe value types into the state-machine...

damageboy commented 8 years ago

@drbo Can you elaborate, perhaps with concrete examples what it is you exactly mean by "CIL-specific" optimizations?

benaadams commented 8 years ago

Yes, C# is not the place to perform optimizations except if they are only feasible at the language level.

I cannot believe I'm even hearing this. You know the saying: garbage in, garbage out. You can optimize a lot of garbage into really high-quality garbage, or optimize non-garbage. I prefer the latter.

@playsomethingsaxman if the optimizations can be done at the cil/il level then it improves all languages not just C# is what I think he meant?

SunnyWar commented 8 years ago

Concerning cil/il level optimizations... here is an example: return x % 2 != 0 ? 4 : 2;

IL_0000: ldarg.0 IL_0001: ldc.i4.2 IL_0002: rem IL_0003: brfalse.s IL_0007 IL_0005: ldc.i4.4 IL_0006: ret IL_0007: ldc.i4.2 IL_0008: ret

Compared to a function that produces identical results: return (x & 1) != 0 ? 4 : 2;

IL_0000: ldarg.0 IL_0001: ldc.i4.1 IL_0002: and IL_0003: brtrue.s IL_0007 IL_0005: ldc.i4.2 IL_0006: ret IL_0007: ldc.i4.4 IL_0008: ret

The IL for the second with produce faster machine code on all known environments than the first. So this level of optimization (pattern matching, as exists in all optimized C++ compilers) should be done not by the backend compiler but by the front end compiler. Essentially I'm claiming that the IL in both should have been the same and should have been the second one.

temporaryfile commented 8 years ago

So this level of optimization (pattern matching, as exists in all optimized C++ compilers) should be done not by the backend compiler but by the front end compiler.

I will confess that the C# compiler is bewilderingly fast and I am in total admiration of the team like you have no idea. For rapid debugging, that's crucial. For release builds, it's almost the exact opposite: let the build machine bear the cost so the end-users don't have to.

I totally agree with you :) I recently tried to take it up on the roslyn gitter and got into a semi-serious discussion with @agocke about this on April 14th, if you follow it through

You are my hero. All I want to do is dereference a pointer, I don't need "fixed" or managed reference management or anything clever. They will argue, "you should be using C++", to which I would respond, "I can outscale C++ on C#6 Win32 interop because async/await inspires more powerful custom algorithms that would make a C++ dev shriek".

SunnyWar commented 8 years ago

I honestly don't understand the "you should be using C++" mentality. I know it existing out in the real world but it appears to have infected the thinking of the .Net teams. I think they should be striving to replace C++ with C# and other CIL languages in nearly all areas of performance. It's appears to be doable. We just need to do it.

Some might say, "no, the garbage collection is always a tax that will slow things down". Yes and no. There are lot's of places where the GC does a lot of work today that is unnecessary. Eliminate those and the excuses start to drop off. For example: dotnet/coreclr#4365 and dotnet/roslyn#161

Then there are the algorithms (which is arguable the bulk of all code) that don't do any garbage collection. In those cases I see very few barriers to very high performance. I'm often surprised at how much fast C++ can do basic math. Why? There's no excuse other than they just haven't done it. Simple as that.

As a matter of fact, C# does a better job than C++ in capturing the "intent" of the developer. This means the compiler should, in theory, be able to produce even more optimized machine code than C++. Add to it the knowledge the JIT has knowledge of the specific hardware and C++ should be the slower option.

BTW: All of that was promised to us back when C# was first introduced. It's time the promise becomes a reality.

agocke commented 8 years ago

@damageboy @playsomethingsaxman You now have two people asking for it, so I take this request more seriously now :)

If you want to create an issue, that works for me.

damageboy commented 8 years ago

@agocke we are few and have a few loose nuts, but pleasant otherwise :)

As long as we're on the subject of CIL / C# Optimization, has there been any discussion of converting Linq constructs into fast, allocation-less imperative code, such as what is being done here: https://github.com/nessos/LinqOptimizer

This looks like a grand optimization/feature for roslyn: Linq!, now with actual performance :)

agocke commented 8 years ago

@damageboy We want to fix this at a lower level, by optimizing delegates and possibly providing a new, lower allocation API instead of IEnumerable (IFastEnumerable?).

@jaredpar Has a bunch of crazy ideas about this.

temporaryfile commented 8 years ago

We want to fix this at a lower level, by optimizing delegates and possibly providing a new, lower allocation API instead of IEnumerable (IFastEnumerable?).

@agocke Don't forget, we still BADLY need IAsyncEnumerable! I can't tell you how bad!

SunnyWar commented 8 years ago

@damageboy I've been hoping for a LinqOptimizer for years! Here you have a functional and very expressive representation of intent and it's slower a the procedural alternative. What the __??

I don't recall the last dev work I worked with that didn't ban Linq for performance reasons...and it's should be opposite. So sad.

mattwarren commented 8 years ago

I don't recall the last dev work I worked with that didn't ban Linq for performance reasons...and it's should be opposite. So sad.

Including Roslyn itself!

But to-be-fair, only in hot-paths, see the coding guidelines

DemiMarie commented 8 years ago

@damageboy By "CIL-specific optimizations" I mean optimizations that are relevant for languages that compile to CIL, but not to C or C++.

LLVM is a fantastic compiler optimizer and backend. However, it only knows about constructs found in C-like languages and lacks optimizations that aren't relevant to C. These include:

Furthermore, LLVM is slow. Not by standards of C compilers, but it is slow for a JIT. The LLVM developers are working on improving this speed, but for now there are no optimizers that are fast enough AND generate sufficiently fast code, and this may not change. That is why tiered JIT compilation and/or AOT compilation is crucial: applying heavy optimizations to all code at runtime simply takes too long. Only an AOT compiler or a higher-tier JIT can afford to take the amount of time needed.

DemiMarie commented 8 years ago

@SunnyWar One problem is that unlike Haskell, C# is not purely functional. That means that most of the rewrites you would like to make are not necessarily valid. GHC's solution is to apply rewrite rules that automatically rewrite the code. These rules are valid for Haskell, but they are not necessarily valid for C# as they may change evaluation order.

xen2 commented 8 years ago

Concerning IEnumerable<T>, one problem is lack of devirtualization by language design. Basically you have to pay the price of interface call + GC allocation in every generic method that takes a IEnumerable<T> and do a foreach loop on it.

Ideally, it would be nice if, in some specific case, we could take a more "template" approach to the problem, where I could write a method like LINQ Sum() based on traits/contracts (there is work in progress on that, but not sure it would allow such cases):

Sum<T, U> (T enumerable) where  T has Trait { U GetEnumerator(); }
{
    foreach (var i in enumerable)
    {
    }
}

Idea is to force this kind of code to be JIT and optimized for each separate type using its overload of GetEnumerator() returning a struct (most .NET collections provide them).

This kind of technique are used widely in C++ std library (and quite powerful when combined with allocation-less functors for std::sort(), etc...)

Right now, this is impossible in C# short of rewriting one method such as Enumerable.Sum() per type you want to optimize (basically doing template instantiation job manually). As a result, you pay the "virtualization" price of IEnumerable (interface calls + GC allocation + boxing) almost all the time, even when the compiler did have more semantic information when compiling that specific generic instantiation. This prevent lot of straightforward inlining/optimizations as well.

I agree this shouldn't be done in all cases (as this dramatically increase code size), but in various scenario (esp. small helper methods such as LINQ, Sort(), etc...), it could go a long way to avoid unecessary GC pressure and also improve performance on tight loops.

benaadams commented 8 years ago

@xen2 bit horrible, but something like?

interface IEnumerable<out T, TEnumerator> where TEnumerator : struct, IEnumerator<T>
{
    TEnumerator GetEnumerator();
}

public T Sum<T, TEnumerable, TEnumerator>(TEnumerable e)
    where T : ...
    where TEnumerable : IEnumerable<T, TEnumerator>
    where TEnumerator : struct, IEnumerator<T>
{
    T sum;
    foreach(var value in e)
    {
        sum += value;
    }

    return sum;
}

Assuming there was something for + operator

temporaryfile commented 8 years ago

Ideas for speed (not sure which is language and which is CLR):

benaadams commented 8 years ago

@playsomethingsaxman if you are dealing with already fixed memory then you might want to check out the System.Runtime.CompilerServices.Unsafe library

https://github.com/dotnet/corefx/blob/master/src/System.Runtime.CompilerServices.Unsafe/src/System.Runtime.CompilerServices.Unsafe.xml

temporaryfile commented 8 years ago

If you are dealing with already fixed memory then you might want to check out the System.Runtime.CompilerServices.Unsafe library

That code gets me thinking... remember those __asm scopes in C++? Why can't we have the same for IL in C#? Or heck, why not just make it a method specifier and be able to plop IL methods right between C# methods? I used to hate having to run MASM separately, used to hate disturbing that locality.

benaadams commented 8 years ago

but __asm scopes would be processor specific so would require more ifdefs and validation? /cc @migueldeicaza

temporaryfile commented 8 years ago

but __asm scopes would be processor specific so would require more ifdefs and validation?

I meant IL, not asm. This crazy business:

    .maxstack 2
    ldarg.0
    ldarg.1
    ldobj !!T
    stobj !!T
    ret
adamchester commented 8 years ago

@playsomethingsaxman F# does this in some special places

Edit: fixed link

aL3891 commented 8 years ago

@playsomethingsaxman This would enable some tasty optimizations, if not only in mscorlib that cant depend on things like Unsafe (I'm sure there's ways around that but this would make it easier)

svick commented 8 years ago

@playsomethingsaxman

temporaryfile commented 8 years ago

You seem to be focusing on the speed of Guid.Parse() quite a lot

Guid.Parse() is just an example everyone understands. But if an actual problem is the price of admittance, I have long lists of them in static classes that run linearly in the static constructors.

Regarding inline IL, what could you achieve using that that you can't already do using C# or S.R.CS.Unsafe?

That question is way beyond my pay grade.

Here's another strong idea: give IntPtr/UIntPtr full operator support like int/uint, maybe even intptr/uintptr keywords (YES PLEASE). I can't tell you all the places we have to use proxy variables, casting and copying, casting and copying. It got so freaking out of control that I had to create my own IntPtrEx/UIntPtrEx structs (sadly they don't perform in tight loops yet). Right now most of us cheat and use long for managed-ish pointer arithmetic because long fits both 64-bit and 32-bit. But will we remember to update that code when the the 128-bit omg word size comes out? This is one of those situations where excessive dogma will actually force people into producing less provable code.

andre-ss6 commented 8 years ago

I tend to agree with others that what we really need right now is a [better] AOT compiler and/or a better JIT.

I have the impression that C# is already fast enough when dealing with "simple" code (structs, no virtual calls, only old-school imperative code (no delegates, no linq)). What is the point of having a lot of new performance features but having to write C++-style code? (OK, maybe an exaggeration here, but hey) Then why just not go back to C++? I know, I know, we want to use C#. I'm not one of these "go to C++" folks; my point is: high performance code in C# should still feel like C#.

I think we first need better compilers. Then new language features.

temporaryfile commented 8 years ago

The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so.

Not only do we not have the opportunity, we are actively told not to, in the hope that a few bright people in our camp know when to ignore the advice.

What is the point of having a lot of new performance features but having to write C++-style code?

You are probably already writing C++-style code if you're any bit worried about performance beyond Big O. You won't even have a choice if any of the .NET Framework's classes under-perform, like ObservableCollection<> which doesn't offer range operations and causes many a XAML list to populate slowly.

DemiMarie commented 8 years ago

That is why I am so interested in LLILC AND in a tiered JIT.

LLILC can perform as many optimizations as Clang can, and Rust has shown that array bounds checks do not create a huge problem. But LLVM is slow – too slow for a JIT that needs to compile all code while the user is waiting.

An AOT compiler or a tiered JIT compiler with deoptimization is the solution. That way LLVM only sees a much smaller amount of code while the user is waiting – or can see all of it, offline.

ilexp commented 8 years ago

I think we first need better compilers. Then new language features.

Relying on the compiler to guess what we're doing and how to optimize it has limits, because it can only assume so much. Having an explicit way of performing certain optimizations as a programmer is more reliable across runtimes and configurations and can have a greater effect, because we as programmers know a lot more context.

I agree that better compilers would be nice. I disagree that "we need them first". Having the right tools for performing certain optimizations explicitly is just as important.

OtherCrashOverride commented 8 years ago

It doesn't really matter how optimized or awesome a JIT compiler is when the garbage collector has stopped all threads of executions to do its work. This is the issue game developers face: everything freezes when a GC happens regardless of how optimized the code is. Its possible to reduce garbage, but not eliminate it entirely in real world code.

GSPP commented 8 years ago

@OtherCrashOverride shouldn't the rather new background GC modes fix that issue? Assuming pause times for G0 and G1 are acceptable at any time.

OtherCrashOverride commented 8 years ago

shouldn't the rather new background GC modes fix that issue?

I doubt it will have any real world affect. Graphic APIs typically issue all drawing on a single thread. When that thread is frozen, so is rendering, resulting in a defect perceptible to the end user. Someone will need to test and measure.

GSPP commented 8 years ago

@OtherCrashOverride maybe you should research what that mode does. It promises to only ever pause threads briefly except under rare circumstances.

You seem to be a game developer and I have wondered for a long time whether that mode solves the problem. I'd like your opinion on it.

OtherCrashOverride commented 8 years ago

It promises to only ever pause threads briefly except under rare circumstances.

This goes back to the deterministic discussion. If you can't predict when and how long, you can't avoid rendering stuttering. You can make everything massively parallel but you still have to "join" when its time to render the current state. This means you will only ever be as fast as the GC despite any compiler and/or code optimizations.

At a minimum, the GC would need to be schedulable and interruptible. The program would need to be able to say "the runtime can GC at this point, but never before or after" and "the GC has this many nanoseconds of time to accomplish whatever it can". This then leads to a possible slow build up of garbage if there is not enough allocated time to clean it all up. This is the reason a reference counting strategy was mentioned earlier. Its completely deterministic and avoids these issues.

Right now there appears to be a "civil war" of sorts in the .Net world. CoreCLR's "customer" is ASP.Net. This puts the feature focus on things that make sense in that world. An example is the JSON/XML/XAML project file battle. After CoreCLR ships and the "dust settles", then would be the time to revisit issues such as these. Currently, not even the tooling is available to make CoreCLR useful in any scenario outside ASP.Net.

OtherCrashOverride commented 8 years ago

Btw...

As eluded to earlier, there is still an issue with the GC performing its duties on a different thread of execution. GUI APIs and graphics API's such as OpenGL are not thread safe: your operations must be performed on specific threads. A concrete example is creating a GPU resource in OpenGL and destroying it as part of a dtor. Currently, there is no way to do this in .Net. You have to delegate the job from the dtor (finalizer) back to the appropriate thread. The means GUI and graphics API's typically end up using IDisposable which requires special code support to check if the object is disposed at every possible execution point. It also requires that code manage the lifecycle and call Dispose() only when appropriate effectively making the GC system pointless.

temporaryfile commented 8 years ago

It also requires that code manage the lifecycle and call Dispose() only when appropriate effectively making the GC system pointless.

I think serious code is ALWAYS going to be deterministic, especially when freeing resources that can only be freed on the creating thread. The GC's divinity is with handling all the tedious collection types in between, and making sure that SafeHandles are only leaked to the finalizer. When writing frameworks I allocate from the unmanaged heap as much as possible while letting the higher glue code "play .NET".

DemiMarie commented 8 years ago

The only solution for games is an almost-fully concurrent collector with strictly bounded pause times. Such a collector may require a read barrier and have reduced throughput. Azul's Zing JVM does this, but the algorithm it uses is patented. The Shenandoah collector proposed for OpenJDK may not be able to support interior pointers. The only other option that I know of is a non-moving collector.

benaadams commented 8 years ago

Are you using the Server GC which mostly runs in parallel? Also are you setting GCLatencyMode.SustainedLowLatency mode during the times when you can't allow blocking collections?

https://blogs.msdn.microsoft.com/dotnet/2012/07/20/the-net-framework-4-5-includes-new-garbage-collector-enhancements-for-client-and-server-apps/

SunnyWar commented 8 years ago

I know I can sound very critical some times, so I decided to take a look at what performance improvement have been mad, or are being made, in the CLR. When I choose the "label:optimization is:closed label:"4 - Done" it comes up with NO results. So then I looked at, "label:performance is:closed label:"4 - Done"...no results. So then I looked at, "is:open label:performance label:"2 - In Progress" and I see only one (thank you @stephentoub). And "is:open label:optimization label:"2 - In Progress" give no results.

So I hear a lot of talk and a lot of great ideas, and I see a many things put in the "future" bin for CLR but at the moment very, very little appears to be actually happening.

stephentoub commented 8 years ago

very, very little appears to be actually happening

Lots of PRs to both coreclr and corefx have been merged to improve performance. Many just haven't been labeled as such.

OtherCrashOverride commented 8 years ago

Are you using the Server GC which mostly runs in parallel? Also are you setting GCLatencyMode.SustainedLowLatency mode during the times when you can't allow blocking collections?

That question should probably be directed at the Unity3D guys. Since their targets include non x86 devices like ARM based mobile phones, I doubt its a feature they could take advantage of. In such constrained environments they do no have the luxury of throwing more cores or more RAM at the problem.

The GC discussion should probably be spun off to a different topic. I think everyone is in agreement that GC does affect performance; however, there is little C# as a language can do about it. In fact, "inline IL" is about the only language feature being asked for. The rest are runtime or framework improvements.

benaadams commented 8 years ago

In fact, "inline IL" is about the only language feature being asked for.

Is this covered by? https://github.com/dotnet/roslyn/issues/11475

agocke commented 8 years ago

@OtherCrashOverride Actually, the purpose of the ref-locals/ref-returns feature is to partially address these GC considerations. It makes it much easier to work with structs without allocation in critical paths.

OtherCrashOverride commented 8 years ago

Is this covered by? #11475

I was hoping to see inline IL. I don't think adding a pseudo-language is the answer (it should be enough to learn IL, not additionally an IL for IL). I'm really tired of flame wars, so I will leave that topic alone.

Actually, the purpose of the ref-locals/ref-returns feature is to partially address these GC considerations.

Something like that can already be done today by using a single element array of a struct as a parameter (and keeping it around). Its syntactically much cleaner than that proposal (IMHO). Its often a solution when you need to use the struct for an interop (GCHandle.Alloc) return result. Again, its not worth the flame war, so I will also leave that topic alone too.

Rather than debates and drama, its probably better to just fork everything after Release and try out alternatives. Then everyone can make more informed feature choices.

jcdickinson commented 8 years ago

Something like that can already be done today by using a single element array of a struct as a parameter (and keeping it around). Its syntactically much cleaner than that proposal (IMHO).

This can also be achieved with ref parameters, which it is in many game frameworks. In terms of code clarity consider:

// Without ref-returns and ref parameters for operators, this copies (fermi estimate, probably less):
// 4 rows * 4 columns = 16 cells 
// 16 cells * 4 bytes for float = 64 bytes per stack frame
// 64 bytes per stack frame * (1 copy for call + 1 copy for return) = 128 bytes per call
// 128 bytes per call * 2 calls = 256 bytes
// However, the code is much clearer.
var wvp = world * view * projection;

vs.

// This is zero-allocation, workaround for the lack of ref-returns and ref parameters for operators.
// However, the code makes developers cry blood.
var wvp = new Matrix();
Matrix.Multiply(ref world, ref view, out wvp);
Matrix.Multiply(ref wvp, ref projection, out wvp);

Arguably DOP would alleviate this (as it does use arrays); however, something so simple shouldn't have such poor readability and should not deceptively copy so much memory.

martinvahi commented 8 years ago

If You are planning future-proof high-performance features for C#, then please do not leave out the research done by Tucker Taft on his ParaSail programming language.

AlgorithmsAreCool commented 8 years ago

@martinvahi Are there any specific parts of interest in ParaSail that would be applicable to C#?

martinvahi commented 8 years ago

@AlgorithmsAreCool I do not know the answer to Your question yet, but I do know that I'm not capable of keeping track of tens of threads manually, which means that the "data parallelism", call it map-reduce, if You will, is currently the only way for a person like me to efficiently utilize multi-core CPU-s. The way I understand it in 201607, the general idea behind algorithm parallelization is that loop iterations that do not depend on each other, can be executed in any order, including on separate CPU-s or even on different machines. From that point onward it's a matter of creative thinking, how to re-arrange a program to as-parallel-as-possible components while keeping in mind that no program can be run in less time than it takes for its un-parallelizable sequential parts to execute (the Amdahl's law.)_.

May be the Abstract Syntax Tree can be somehow analyzed and re-arranged to utilize N threads as efficiently as possible. I do not know the answer yet, but the ParaSail is certainly one thing to study, when planning for new performance features.

svick commented 8 years ago

@martinvahi

From the page about ParaSail you linked to:

given a ParaSail expression like F(X) + G(Y), the language rules ensure that it is safe to evaluate F(X) and G(Y) in parallel

I don't see how something like that could work on the CLR. .Net does have global state, parameter aliasing, pointers (both actual pointers and references, which I think are effectively the same thing in this case), etc.

This might not make automatic parallelization impossible, but it does make it much harder.