State / Direction of C# as a High-Performance Language

ilexp commented 8 years ago

I've been following recent development of C# as a language and it seems that there is a strong focus on providing the means to write code more efficiently. This is definitely neat. But what about providing ways to write more efficient code?

For context, I'm using C# mostly for game development (as in "lowlevel / from scratch") which has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization.

Issues related to this:

Ref returns / ref locals: Issue #118
Slicing: Issue #120
Array views of blittable data types: CoreCLR Issue 1015
Generic API for unsafe read / write: CoreFx Issue 5474
Support for SSE4 Intrinsics: CoreFx Issue 2209
Efficient unmanaged memory operations: CoreCLR Issue 916
Vector shuffling SIMD operations: CoreFx Issue 1168
Handling overlapped explicit FieldOffsets in structs: Issue #10319 and #7323
Extended JIT time constants: CoreCLR Issue 2591
Extended Compile time constants: Issue #10972
Extended unsafe generics: Issues #3208 and #3210
PrimitiveValueType and Generic Pointers: Issue #2209
Custom memory allocations: CoreCLR Issue 1235
There are probably more - please share them in a comment

Other sentiments regarding this:

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?
Language support for object pooling / limited control over what exactly "new" does for a given class / reference type.
A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.
Please share your own in a comment

This is probably more of a broader discussion, but I guess my core question is: Is there a general roadmap regarding potential improvements for performance-focused code in C#?

JoshVarty commented 8 years ago

A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I believe this was also requested in https://github.com/dotnet/roslyn/issues/161

HaloFour commented 8 years ago

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I don't believe that these issues can be solved without direct CLR support. The CLR limits reference types to the heap. Even C++/CLI is forced to abide by that restriction and the stack semantics syntax still allocates on the heap. The GC also provides no facility to directly target specific instances.

I wonder how much C# could make a struct feel like a class before it crosses into unsafe/unverifiable territory. C++/CLI "native" classes are CLR structs so you don't have to deal with allocation/GC but of course the IL it emits is quite nasty.

ilexp commented 8 years ago

I've added some more related issues to the above list, which hadn't been mentioned yet.

paulcscharf commented 8 years ago

I am in a very similar position to @ilexp, and generally interested in the performance of my code, and knowing how to write efficient code. So I'd second the importance of this discussion.

I also think the summary and points in the original post are quite good, and have nothing to add at the moment.

Small note on using structs sort of like classes (but keeping everything on the stack): I believe we can 'just' pass our structures down as ref for this purpose? Make sure you don't do anything that it that creates a copy, and it should look like a class... Not sure if that work flow needs any additional support from the language.

About memory locality: I was under the impression that if I new two class-objects after each other, they will also be directly after each other in memory, and stay that way? May be an implementation detail, but it's better than nothing... That being said, I've had to move from lists of objects to arrays of structs for performance reasons as well (good example would be particle systems, or similar simulations that have many small and short lived objects). Just the overhead from resolving references and having to gc the objects eventually made my original solution unfeasible. I am not sure this can be 'fixed' in a managed language at all though...

Looking forward to seeing what others have to say on this topic!

mattwarren commented 8 years ago

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

There was a really nice prototype done by @xoofx showing the perf improvements of allowing stackalloc on reference types.

SunnyWar commented 8 years ago

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

Microsoft Research many years ago experimented with using some unused bits on each object as access counters. The research hacked the heap to re-organized mostly used objects so that they ended up on the same page. He showed in a sample XML parser that C# code was faster than optimized C++. The talk he gave on it was called, "Making C# faster than C#". The researcher that developed the technique left MS and the research apparently died with him. He had a long list of other, similar improvements that he was planning on trying. None of which, I believe, saw daylight.

Perhaps this work should be resuscitated so that the promise (made in the beginning: remember how the JITer was going to ultra-optimize for your hardware??) can be realized.

Claytonious commented 8 years ago

We are in the crowded boat of using c# with Unity3d, which may finally be moving toward a newer CLR sometime soon, so this discussion is of great interest to us. Thanks for starting it.

The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so.

ygc369 commented 8 years ago

I have some ideas: https://github.com/dotnet/coreclr/issues/555 https://github.com/dotnet/coreclr/issues/1784 https://github.com/dotnet/coreclr/issues/757 https://github.com/dotnet/roslyn/issues/2171 https://github.com/dotnet/coreclr/issues/1856

IanKemp commented 8 years ago

"game development... has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization."

C# gets in the way because that's what it was designed to do.

If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so. Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases.

orthoxerox commented 8 years ago

@IanKemp that's a very narrow view of C#. There are languages like Rust that try to maximize correctness without run-time overhead, so it's not one vs the other. While C# is a garbage-collected language by design, with all the benefits and penalties that it brings, there's no reason why we cannot ask for performance-oriented improvements, like cache-friendly allocations of collections of reference types or deterministic deallocation, for example. Even LOB applications have performance bottlenecks, not just computer games or science-related scripts.

svick commented 8 years ago

@IanKemp Are you saying that unsafe does not exist? C# had that from the start and it's exactly for that small amount of code where you'll willing to sacrifice safety for performance.

SunnyWar commented 8 years ago

Hey, people...try this: write a function will result in no garbage collections....something with a bunch of math in it for example. Write the exact same code in C++. See which is faster. The C++ compiler will always generate as fast or faster code (usually faster). The Intel compiler is most often even faster...it has nothing to do with the language.

For example. I wrote a PCM audio mixer is C#, C++ and compile with the .Net, MS, and Intel compilers. The code in question had no GC, no boundary checks, no excuses.

C#: slowest C++ Microsoft: fast C++ Intel: super fast

In this example the Intel compiler recognized that computation could be replaced by SSE2 instructions. the Microsoft compiler wasn't so smart, but it was smarter than the .Net compiler/JITer.

So I keep hearing talk about adding extensions to the language to help the GC do things more efficiently, but it seems to me ...the language isn't the problem. Even if those suggestion are taken we're still hamstrung by an intentionally slow code generating compiler/jitter. It's the compiler and the GC that should be doing a better job.

See: #4331 I'm really tired of the C++ guys saying, "we don't use it because it's too slow" when there is _very little reason _for it to be slow.

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

msedi commented 8 years ago

I completely agree with all of the mentioned improvements. These are in my opinion absolutely mandatory. Using C# in high performance applications is the right was. It makes code much easier to read if there would be at least some of the suggested improvements. Currently we have to "leave" the language to C++ or C to create things there are not possible in C#, and i don't mean assembler instructions but very simple pointer operations on blittable Data types or generics.

So not to leave the language i created unreadable Code fragments just not to use unmanaged code because then i am dependent on x86 and x64.

ilexp commented 8 years ago

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

From a gamedev perspective, it would be neat if there was a way to tell the runtime to perform extended JIT optimization using framework API.

Let's say by default, there is only the regular, fast optimization, the application starts up quickly and all behaves as usual. Then I enter the loading screen, because I'll have to load levels and assets anyway - now would be an excellent time to tell the runtime to JIT optimize the heck out of everything, because the user is waiting anyway and expecting to do so. This could happen on a per-method, per-class or per-Assembly level. Maybe you don't need 90% of the code to be optimized that well, but that one method, class or Assembly should be.

As far as server applications go, they could very well do the same in the initialization phase. Same for audio, image and video processing software. Extended JIT optimization could be a very powerful opt-in and on runtimes that do not support this, the API commands can still just fall back to not having any effect.

Maybe it would even be possible to somehow cache the super-optimized machine code somewhere, so it doesn't need to be re-done at the next startup unless modified or copied to a different machine. Maybe partial caches would be possible, so even if not all code is super-JITed yet, at least the parts that are will be available. Which would be a lot more convenient and portable than pre-compiling an Assembly to native machine code, simply because Assemblies can run anywhere and native machine code can not.

All that said, I think both allowing the JIT to do a better job and allowing developers to write more efficient code in the first place would be equally welcome. I don't think this should be an either / or decision.

xoofx commented 8 years ago

While advocating for many years about performance for C#, I completely concur with the fact that It would be great to see more investments in this area.

Most notably on the following 3 axes:

Allow to switch-on a better code gen JIT (but slower). There is high hope that this will be fulfilled by the undergoing work on LLILC, for both JIT and AOT scenarios. Note that many platforms (e.g iOS, UWP/XboxOne, PS4) don't support JIT scenarios. But It will take time to achieve even performance parity with the current JIT and there are some language/runtime constraints that could make full optimization difficult (GC statepoints, array/null/arithmetic safe checks...etc.)
Improve the language (with sometimes a proper JIT/GC support) that could help in this area. That include things listed above like ref locals, array slices, string slices... and even builtin utf8 strings... Some hacks can be done by post-processing IL and have been abused in many projects, but it would be great to have these little things available without making any IL voodoo.
Put a lot more emphasis on memory management, data locality and GC pressure
- Standard improvements like stack alloc for class, embeded class instance, borrowed pointers
- Rethink our usage of the GC, while a bit more problematic, as I haven't seen much proven models in production (things like: explicit vs implicit management of GC regions to allocate known objects to a proper region of objects that would relate in terms of locality/longevity)

Unfortunately, there are also some breaking change scenarios that would require to fork the language/runtime to correctly address some of the intrinsic weakness of the current language/runtime model (e.g things that have been done for Midori for their Error Model or safe native code for example...etc.)

svick commented 8 years ago

@SunnyWar I think there's enough enough room to optimize both code generation for math and GC.

As to which one should have higher priority, keep in mind that it's relatively easy to workaround bad performance in math by PInvoking native code or using Vector<T>. Working around bad performance due to GC overhead tends to be much harder, I think.

And since you mention servers, a big part of their performance are things like "how long it take to allocate a buffer", not "how long does it take to execute math-heavy code".

GSPP commented 8 years ago

I'm adding JIT tiering to the list of features I see as required to make C# a truly high performance language. It is one of the highest impact changes that can be done at the CLR level.

JIT tiering has impact on the C# language design (counter-intuitively). A strong second tier JIT can optimize away abstractions. This can cause C# features to become truly cost free.

For example, if escape analysis and stack allocation of ref types was consistently working the C# language could take a more liberal stance on allocations.

If devirtualization was working better (right now: not all all in RyuJIT) abstractions such as Enumerable.* queries could become cost free (identical performance to manually written loops).

I imagine records and pattern matching a features that tend to cause more allocations and more runtime type tests. These are very amenable to advanced optimizations.

OtherCrashOverride commented 8 years ago

Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.

Support for SSE4 Intrinsics: CoreFx Issue 2209

This is completely useless on the billions of ARM devices out there in the world.

With regard to the GC discussion, I do not think that further GC abuse/workarounds are the solution. Instead there needs to be a deterministic alloc/ctor/dtor/free pattern. Typically this is done with reference counting. Today's systems are mutli-core, and today's programs are multi-threaded. "Stop the world" is a very expensive operation.

In conclusion, what is actually desired is the C# language and libraries but on top of a next-generation runtime better suited for the needs of "real-time" (deterministic) development such as games. That is currently beyond the scope of CoreCLR. However, with everything finally open source, its now possible to gather a like minded group to pursue research into it as a different project.

TimPosey2 commented 8 years ago

I'm doing a lot of high-perf / low latency work in C#. One thing that would be "the killer feature" for perf work is for them to get .NET Native fully working. I know it's close, but the recent community standups have said that it won't be part of v1.0 RTM and they're rethinking the usage for it. The VS C++ compiler is amazing at auto-vectorizing, dead code elimination, constant folding, etc. It just does this better than I can hand-optimize C# in its limited ways. I believe traditional JIT compiling (not just RyuJIT) just doesn't have enough time to do all of those optimizations at run-time. I would be in favor of giving up additional compile time, portability, and reflection in exchange for better runtime performance; and I suspect those that are contributing to this thread here probably feels the same way. For those that aren't, then you still have RyuJIT.

Second, if there were some tuning knobs available for the CLR itself.

GSPP commented 8 years ago

Adding a proposal for Heap objects with custom allocator and explicit delete. That way latency-sensitive code can take control of allocation and deallocation while integrating nicely with an otherwise safe managed application.

It's basically a nicer and more practical new/delete.

benaadams commented 8 years ago

@OtherCrashOverride @GSPP Destructible Types? https://github.com/dotnet/roslyn/issues/161

OtherCrashOverride commented 8 years ago

Ideally, we want to get rid of IDisposable entirely and directly call the dtor (finalizer) when the object is no longer in use (garbage). Without this, the GC still has to stop all threads of execution to trace object use and the dtor is always called on a different thread of execution.

This implies we need to add reference counting and modify the compiler to increment and decrement the count as appropriate such as when a variable is copied or goes out of scope. You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

Of course, all this is speculation at this point. But the theoretical benefits warrant research and exploration in a separate project. I suspect there is much more to gain from redesigning the runtime than there is from adding more rules and complication to the language.

SunnyWar commented 8 years ago

@OtherCrashOverride I've also come to the conclusion that a reference counting solution is critical for solving a number of problems.

For example, some years ago I wrote message passing service using an Actor model. The problem I ran into right away is I was allocating millions of small objects (for messaging coming in) and the GC pressure to clean after they went out was horrid. I ended up wrapping them in a reference counting object to essentially cache them. It solved the problem BUT I was back to old, ugly, COM days of having to insure every Actor behaved and did an AddRef/Release for every message it processed. It worked..but it was ugly and I still dream of a day I can have a CLR managed reference countable object with an overloadable OnRelease, so that I can put it back in the queue when the count==0 rather than let it be GC'd.

ilexp commented 8 years ago

Don't want to detail the rest of it in this general overview thread, just regarding this specific point of @OtherCrashOverride's posting:

[...] than there is from adding more rules and complication to the language.

As a general direction of design with regard to future "efficient code" additions, I think it would be a good thing to keep most or even all of them - both language features and specialized API - hidden away just enough so nobody can stumble upon them accidentally, following the overall "pit of success" rule if you will.

I would very much like to avoid a situation where improving 0.1% of performance critical code would lead to an overall increase in complexity and confusion for the 99.9% of regular code. Removing the safety belt in C# needs to be a conscious and (ideally) local decision, so as long as you don't screw up in that specific code area, it should be transparent to all the other code in your project, or other projects using your library.

svick commented 8 years ago

@OtherCrashOverride

You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

That would require you to find and update all existing references to that object. While the GC already does that when compacting, I doubt doing it potentially at every method return would be efficient.

SunnyWar commented 8 years ago

Today, the system imposes limitations on us that are purely historical and in no way limit how things can be done in the future.

OtherCrashOverride commented 8 years ago

That would require you to find and update all existing references to that object.

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

xoofx commented 8 years ago

@OtherCrashOverride It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good. The latest research on the matter are even blurring the lines between the two techniques...

Afaik, the most recent research on the subject, RC Immix conservative (2015) which is a continuation of RC Immix (2013) which is a derivation of GC Immix (2008), shows that the best RC collector is able to outperform just slightly the best GC (RC Immix conservative vs GenImmix). Note also that you need a kind of reference cycle collector for a RC Immix to be fully working (detect cycle in object graphs and collect them). You will see in these documents that RC Immix is build upon this 3 key points:

The original Immix memory organization (which is a lot related to locality and friendly with CPU cache lines)
Reference counting is not occuring on each object but on an Immix heap line
A compacting memory scheme

Though I would be personally in favor of switching to a RC Immix scheme, mostly because it delivers a better predictability in "collecting" objects (note even with RC Immix, collection of objects is not immediate in order to achieve better performance)

That being said, again, there is a strong need for other memory management models to fill the gap in terms of performance, because the GC/RC model in itself is not enough alone (alloc class on the stack, alloc of embed instance in a class, borrowed/owned pointers (single owner reference which are destructible once there is no more owner))

OtherCrashOverride commented 8 years ago

It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good.

The goal is not to solve the problem of memory management; rather, to make it deterministic. The focus is on when object cleanup takes places rather than how long it takes. Currently, games and other real-time systems suffer due to the GC halting everything for an indeterminate amount of time while collection takes place. Additionally, running finalizers in different threads causes issues for non-thread-safe GUI APIs and 3D APIs like DirectX and OpenGL. Controlling when an object is collected allows the developer to amortize this cost as needed for the program.

[edit] Here is an example of a real-world problem deterministic memory management would solve: http://geekswithblogs.net/akraus1/archive/2016/04/14/174476.aspx

xoofx commented 8 years ago

The goal is not to solve the problem of memory management; rather, to make it deterministic.

Sure, I know what a RC is good for and that's what I implied by :wink:

mostly because it delivers a better predictability in "collecting" objects

But the determinism can be reinforced by alternative forms. The following options are perfectly deterministic, and they deliver even better performance in their use cases compare to a GC/RC scenarios because they are lighter and exactly localized to where they are used:

Allocation of a class on the stack
Embed instance
Borrowed pointer (known also as single owner reference)

Note that in order for a RC scheme to be enough efficient, you still need to batch the collection of objects (see what RC Immix is doing). Also, you usually don't expect having to run a finalizer on every object.

If you have followed the development of a language like Rust, they started to have RC objects along Borrowed/Owner reference, but at some point, they almost completely ditch RC and they are able to live without much using it. One idea lies into the fact that most objects used in a program are not expected to be concurrently shared between threads (or even stored in another objects for stack alloc cases...etc.), and this simple observation can lead to some interesting language designs/patterns/optims that RC/GC scenarios are not able to leverage.

Bottom line is: allocation on the GC/RC heap should be the exception, not the rule. That's where deterministic allocation/deallocation really starts to shine in modern/efficient language. But that's a paradigm that is not easy to introduce afterwards without strong breaking changes.

OtherCrashOverride commented 8 years ago

Before any more academic papers are cited, we should probably get back to the topic of discussion: "State / Direction of C# as a High-Performance Language"

My suggestions was effectively "Make unsafe code easier to use and more of a first class citizen." I will add that this includes "inline assembly" of .Net bytecode.

Other than that, I do not feel that making C# more burdened with special rules and special cases is the optimal solution (re: destructable types). Instead, modifying the runtime itself could transparently extend performance gains to all .Net languages. As an example of this, a reference counting strategy was cited. It is outside the scope to define that strategy here.

Also mentioned was that it is now possible to actually implement and test these ideas since everything is now open source. That is the point in time for discussion about implementation strategies and citing of academic papers. We can transform theory to reality and have a concrete representation of what works, what doesn't, what can be done better, and what it will break. This would be far more useful than just debating theory in GitHub comments.

Furthermore, I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

[edit] To clarify: one of the points I am attempting to illustrate is that as we continue to expand across more cores and more threads using those cores, the cost of stopping all of them to do live object tracing becomes increasingly more expensive. This is where the reference counting discussion comes into play. Its a suggestion not a proposal.

xoofx commented 8 years ago

This would be far more useful than just debating theory in GitHub comments.

Precisely, I have contributed to the prototype for class alloc on the stack linked above and I'm a strong supporter to encourage .NET devs performance enthusiasts to build such prototypes.

svick commented 8 years ago

@OtherCrashOverride

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap? That doesn't sound like a great win, especially since it also makes every dereference about twice as slow as previously.

jonathanmarston commented 8 years ago

@OtherCrashOverride

I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

I agree that intrinsics shouldn't be created for all SSE4 instructions just for the sake of allowing access to SSE4 from C#, but there are some very common bit-level operations for which it would make sense to add highly optimized implementations to the framework anyway. For these, why not make them JIT to a single instruction, if available on the executing platform?

POPCNT especially comes to mind here. Having a BitCount method for numeric types built in to the framework makes sense even without intrinsics (just look how many times MS has reimplemented this algorithm across their own codebases alone). Once it's added at the framework level, why not put the little bit of extra effort in to optimize it away during the JIT when the instructions are available to do so?

OtherCrashOverride commented 8 years ago

@svick

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap?

This discussion is not intended to define a specification for memory management. It is hoped that an implementer of such a system would be competent and intelligent enough to avoid obvious and naive design issues.

@jonathanmarston They key point is that adding something like a BitCount method is a framework and/or runtime modification. It should not require any modification to the C# language to support. I am certainly in favor of adding first class SIMD support to the framework/runtime; however, I do not believe it has a place as part of the C# language itself as SSE4 (or other architecture) intrinsics would do.

msedi commented 8 years ago

I also agree that there should be no SSE or other intrinsics in the IL. Moreover there should be a more generic abstraction of this command to let the JITter recognize that there might be some, for example vector operation. Currently writing a simple addition of two arrays produced too much IL code.

But I must admit I'm currently not aware of what the JITter really makes out of this code.

Another things is, that it really might be helpful to go back to some inline IL, something like the asm keyword in C/C++. Event Serge Lidin and some other guys wrote some pre- and postprocessor for C# to feed in IL code. But in fact I don't like this back and forth assembling and disassembling just to get native .NET things in my code.

OtherCrashOverride commented 8 years ago

something like the asm keyword in C/C++

I also mentioned that is something I would like. It solves problems such as this: dotnet/corefx#5474

It would also be extremely useful for 'wrappers' that interop with native libraries. Currently, something like SharpDX needs to patch IL bytecode.

temporaryfile commented 8 years ago

Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.

On that subject, I absolutely hate that async/iterator methods can't be unsafe. A lot of potential is lost with this blindly religious constraint.

If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so.

C# is an anti-boilerplate language that lets you see a bigger picture, sooner. That's why we're here. C# is the only serious language that solves the confusion of continuations and that alone makes it more valuable in interop scenarios than using C++ itself. C# is fast becoming the One True Language being used in scripting AND game engines.

Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases.

If you don't pay attention to performance, you don't scale.

jcdickinson commented 8 years ago

If you want to write code that disregards correctness in favour of performance

If correctness must suffer, mark it as such with unsafe.

See: Rust. Correctness does not need to suffer for performance.

From #118. The lack of some of these constructs actually limit the correctness of safe programs:

Interestingly, that support in the CLR is actually a more general mechanism for passing around safe references to heap memory and stack locations; that could be used to implement support for ref return values and ref locals, but C# historically has not provided any mechanism for doing this in safe code. Instead, developers that want to pass around structured blocks of memory are often forced to do so with pointers to pinned memory, which is both unsafe and often inefficient.

DemiMarie commented 8 years ago

My thoughts:

The CLR support for safe references needs to be exposed in full.
SIMD instructions should be a library/runtime feature, not a language feature. They should be JIT intrinsics rather than IL opcodes.
With regards to deterministic destruction: I think a major gain would be a real-time garbage collector. I know of a few techniques:
- IBM's Metronome
- Azul Systems's C4 (unfortunately patented and requires kernel support)
- Red Hat's Shenendoah

Unfortunately, I don't know if all of them could handle the requirements of .NET. Failing that, I think that a fully concurrent, non-compacting mark/sweep collector might help. I believe that [the new garbage collector proposed for LuaJIT][1] represents the state of the art – although it does not consider parallelism or concurrency it seems to me to be a type that could be adapted to this situation.

With regard to multicore scaling: I do not believe that a single shared GC heap can scale unless the GC is aware of it AND memory allocation rates can be controlled (to avoid becoming memory bandwidth bound).

Sadly, I don't think C# can ever become as fast as Rust, except in areas such as memory allocation. Rust was designed to be both fast and correct, and had to trade off some usability to do it, such as with explicit lifetime tracking.

SunnyWar commented 8 years ago

Read this: Introducing a new, advanced Visual C++ code optimizer

I did a simple test to see if .Net did even one of the most basic of substitution optimizations %2 -> &1. nope.

So we can talk about all kinds of things to make the algorithms, data structures, and language faster but it seem to me that a more fundamental problem needs to be addressed first.

The C++ teams and the .Net teams need to share knowledge. Roslyn and the CLI should look at every optimization that C++ compiler and linker makes and figure out how, if possible, they can get them into .Net. Until that happens, .Net will always be the slow step child to C++.

paul1956 commented 8 years ago

This is just my guess but the C++ compiler has always been pushed/helped by Intel optimizing C++ compiler (I know the reverse is true) no such competition exists for .Net. Intel has (had?) whole teams looking to get every last cycle out of their CPU's in order to beat competing architectures and it was cheaper/faster to make compiler better that the CPU.

GSPP commented 8 years ago

@SunnyWar that is true. Another one is that a.x + a.x is translated to two loads instead of one. Some of the most basic optimizations are not implemented.

RyuJIT brought loop cloning which is a big win. This needs to be acknowledged as well. It's not all bad.

Given that C compilers differ only in a few percent (+-20% on real code?) of performance I doubt that the Intel C compiler was a tool suitable to beat the competition.

benaadams commented 8 years ago

C++ compilation also takes a very very long time

temporaryfile commented 8 years ago

NGEN is supposed to do all the work on the dev's computer so it doesn't have to use up CPU on the user's computer. Yet NGEN doesn't optimize much of anything. It's a tragedy.

.NET original development seems to be lax in general. They won't say this but I suspect the whole move to "universalize" the core and replace the Win10 version with WinRT underneath is based on a grossly non-scaling Framework implementation, like I/O that uses event signalling instead of IOCP packets.

HaloFour commented 8 years ago

@playsomethingsaxman

The "ngen" tool has nothing to do with the developer's computer. It's supposed to generate cached optimized images on the user's computer during an installation process.

If "ngen" or the JIT is not performing basic optimizations then that should be considered either a bug or a feature request of the CLR, not of Roslyn. It's not Roslyn's place to perform aggressive optimizations. Doing so would be counterproductive as it would make it more difficult for the JIT to do its job. IL expresses intent, not implementation.

SunnyWar commented 8 years ago

@HaloFour I disagree...kinda. If Roslyn can produce a result that is easily highly optimizable, it should do so. The .NET team has said over and over and over that they care more about how fast the JIT loads the assembly than they do about how fast the executable runs. Until that changes, the JIT is NOT going to do the kind of optimization we all want.

So...that leaves us with a big hole. JIT isn't going to do it. Roslyn thinks it's not it's problem...so nothing happens.

And the consumer get's sub-optimal programs, the server runs sub-optimal code. The cost for server farms running .Net apps goes up. The number of ASP.Net servers you need goes up. The number of Exchange servers is higher than necessary. It's a lose, lose situation.

HaloFour commented 8 years ago

@SunnyWar

As mentioned in https://github.com/dotnet/roslyn/issues/11268#issuecomment-218826517 high-ranking members of the Roslyn team were on the Java team when byte code optimizations were removed from javac because it was demonstratively counter-productive.

GSPP commented 8 years ago

Yes, C# is not the place to perform optimizations except if they are only feasible at the language level.

But the C# compiler already breaks that rule by optimizing nullables in a variety of ways. This is painting over the JITs inabilities. I don't question that choice, I think that's sensible.

I don't think this discussion thread can avoid JIT quality entirely because JIT quality influences C# design decisions. (Example: Better escape analysis in the JIT could make C# more lenient towards introducing allocations.)

temporaryfile commented 8 years ago

Yes, C# is not the place to perform optimizations except if they are only feasible at the language level.

I cannot believe I'm even hearing this. You know the saying: garbage in, garbage out. You can optimize a lot of garbage into really high-quality garbage, or optimize non-garbage. I prefer the latter.

dotnet / roslyn

State / Direction of C# as a High-Performance Language #10378