daokoder / dao

Dao Programming Language
http://daoscript.org
Other
199 stars 19 forks source link

Enhancement: VM optimizations (compilation, GC, etc.) #332

Open dumblob opened 9 years ago

dumblob commented 9 years ago

Recently I stumbled upon a feature of GCC/Clang which allows to mark functions to pass their first two arguments in registers (fastcall) or optimize memory copying/alignment etc. This is usefull for functions called very often or in tight loops or memory allocation functions etc. See https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html .

Even if it seems it's not yet time for optimizations (i.e. ifdefs), I wanted to make a public note about it.

Night-walker commented 9 years ago

I think it's probably redundant -- the compiler should do that anyway with -O3 etc. Also, it might be unsuitable for exported functions because it modifies the calling convention.

dumblob commented 9 years ago

-O3 is never used in production environment (not even ATLAS uses it), so manual marking makes sense.

Also, it might be unsuitable for exported functions because it modifies the calling convention.

Sure, fastcall was just an example and as I pointed out, it's about manual marking certain functions with convenient attributes (not making it "default" for all).

Night-walker commented 9 years ago

-O3 is never used in production environment

Why? What is used instead?

dumblob commented 9 years ago

Why?

Because it's dangerous (e.g. programmer-optimized multithreaded code with volatile and static variables is a nightmare) and sometimes considerably slower than -O2 (e.g. the mentioned ATLAS).

What is used instead?

-O2

Night-walker commented 9 years ago

And why do you think it's not automatically optimized at -O2? If there is a firm benefit, GCC will most likely try to apply the optimization anyway. fastcall doesn't seem to be very important, however -- if the function is small, it can just be inlined; otherwise the registers will quite likely be occupied, so the arguments will still have to go to stack.

dumblob commented 9 years ago

Actually not. E.g. inlining is in -O3 and function attributes exist (e.g. fastcall), because compiler has no way how to determine which function does what (and therefore it's not part of -O2). The same applies to Clang.

otherwise the registers will quite likely be occupied, so the arguments will still have to go to stack.

This is btw not that much true for modern processors, as they have plenty of "general purpose" registers (and I'm not talking only about ARM). It's true, that most/all arguments will be in some of the CPU caches, so the speedup won't be so extreme from this point of view. But on the other hand, using registers means leaving space in caches for other values and that matters. All in all, atrributes (including fastcall) definitely make sense. We can easily measure the impact on our synthetic benchmarks.

Night-walker commented 9 years ago

Actually not. E.g. inlining is in -O3 and function attributes exist (e.g. fastcall), because compiler has no way how to determine which function does what (and therefore it's not part of -O2). The same applies to Clang.

-O2 implies -finline-small-functions.

This is btw not that much true for modern processors, as they have plenty of "general purpose" registers (and I'm not talking only about ARM). It's true, that most/all arguments will be in some of the CPU caches, so the speedup won't be so extreme from this point of view. But on the other hand, using registers means leaving space in caches for other values and that matters. All in all, atrributes (including fastcall) definitely make sense. We can easily measure the impact on our synthetic benchmarks.

If it's that simple, I guess the compilers should again handle that by themselves, no? I don't remember seen those optimization attributes used anywhere, including high-performance low-level libraries.

dumblob commented 9 years ago

-O2 implies -finline-small-functions.

Yes, small functions, but not functions called frequently => function attributes added manually.

If it's that simple, I guess the compilers should again handle that by themselves, no?

I don't think it's simple as the compiler doesn't know anything about the frequencies of calls and copying etc. It does only a dumb static analysis.

I don't remember seen those optimization attributes used anywhere, including high-performance low-level libraries.

Apparently those low-level libs were not enough high-performance or they don't bother about the 5-10% speedup or have it on their todo-list :) But serious virtual machines (JVM...) and system kernels (Linux, BSD, Haiku...) use it quite a lot. Also it's very useful for JIT compilation (even though we have currently JIT only for numerical expressions, in the upcoming future, we might want to extend it and we will benefit from such markings).

Night-walker commented 9 years ago

Apparently those low-level libs were not enough high-performance

Well, Intel DPDK library is designed to be very-high-performance, and it's actually much faster then the built-in Linux kernel network drivers which it replaces. And it's damn serious :) Its developers don't seem to bother using anything beside inline for function optimization.

dumblob commented 9 years ago

Yep, I like DPDK :) But as you might have noticed, it's mostly about shuffling with fixed-sized data and doing arithmetic calculations. Nothing less, nothing more. I doubt function attributes or manual inlining would produce distinguishably faster code.

Night-walker commented 9 years ago

Actually, for now it may be sufficient to simply mark some functions (like the one for utf-8 decoding) as inline -- it's a standard C99 feature, and it may make a difference even at -O3.

daokoder commented 9 years ago

I don't think we can gain much performance improvements by tuning compiling flags, -O2 is already doing a very good job.

I think the best to way improve VM performance is to understand where the overheads are and figure out how to reduce them. In DaoVM, the main overheads are in the following places:

Given what I understood, little can be done at the moment. Anyway, I am not concerned with the performance now, since it is already reasonably fast (compared with other scripting languages of course).

dumblob commented 9 years ago

I don't think we can gain much performance improvements by tuning compiling flags

I think it could be about 5-10%.

the main overheads are in the following places: ...

According to my understanding of the source code, each of these, except for the runtime type checking, is about garbage collector and/or (de)allocation. I'm curious about caching techniques as I haven't studied anything deeply enough about garbage collectors.

Given what I understood, little can be done at the moment.

Yes, as I said in the first post - I just wanted to make a public note to keep track about it.

daokoder commented 9 years ago

I think it could be about 5-10%.

Yes, this is a reasonable estimation. But in the current stage of the development, I am not very enthusiastic about such speedup, I prefer more :)

I'm curious about caching techniques

I have experimented a couple of times for adding object caching, but the gain was too little to justify the additional complexity. So I discarded those experiments. However, it is possible that I wasn't doing it in the right way. So caching techniques are still on the table for future exploration.

Yes, as I said in the first post - I just wanted to make a public note to keep track about it.

Right, this could be helpful.

dumblob commented 9 years ago

Btw recently released version of Ruby uses yet better GC algorithm. I think it's worth reading to get some inspiration: https://bugs.ruby-lang.org/issues/10137 .

daokoder commented 9 years ago

Btw recently released version of Ruby uses yet better GC algorithm. I think it's worth reading to get some inspiration: https://bugs.ruby-lang.org/issues/10137 .

Actually, Dao has been using concurrent and incremental GC for years, it was also make generational a few years ago. I believe, in terms of implementation, Dao is way ahead of Ruby (and maybe many other scripting languages as well) :)

dumblob commented 9 years ago

Oops, I somehow missed it's incremental. Great news :)

daokoder commented 9 years ago

More precisely, it is incremental when the program is running in single threaded mode, and switches to concurrent GC once the program starts to run in multithreaded mode. It is generational in both scenarios.

Night-walker commented 9 years ago

More precisely, it is incremental when the program is running in single threaded mode, and switches to concurrent GC once the program starts to run in multithreaded mode. It is generational in both scenarios.

I can only suggest inspecting the GC structure and maybe other global stuff for signs of false sharing. It may become a substantial holdback in achieving near-to-linear growth of performance with multiple concurrent threads.

daokoder commented 9 years ago

I can only suggest inspecting the GC structure and maybe other global stuff for signs of false sharing.

For GC structure, not much can be done for this, because the GC structure is mostly accessed by the GC thread alone. Besides the locks, the only field that may be accessed frequently by mutator threads is the idleList field. I think moving it somewhere else won't help much with the current implementation, because even if we remove false sharing of this field between the GC thread and the mutator threads, there is still true sharing of this field between different mutator threads.

dumblob commented 9 years ago

Besides the locks, the only field that may be accessed frequently by mutator threads is the idleList field.

We can try to check the impact of false sharing by aligning every void pointer in the idleList to cache line size.

Night-walker commented 9 years ago

Also, some low-level optimization can be achieved by using likely() and unlikely() for branch prediction when on GCC:

#ifdef __GNUC__

#ifndef likely
#define likely(x)  __builtin_expect((x),1)
#endif

#ifndef unlikely
#define unlikely(x)  __builtin_expect((x),0)
#endif

#else
#define likely(x) (x)
#define unlikely(x) (x)
#endif

It can be used particularly well for frequently performed error checks: type conformance, NULL checks, etc. Might make a difference for tight loops, depending on the hardware.

Night-walker commented 9 years ago

Clang also supports __builtin_expect, by the way.

dumblob commented 9 years ago

I realized, that guys from Go finally decided to speed up their GC and make it "predictable" in terms of latency. Their plan (already half-realized in the current Go 1.4) is worth reading (https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLovrHvvLhK_h0KN8woTO4/preview?sle=true&pli=1) and the referenced book The Garbage Collection Handbook: The Art of Automatic Memory Management is basically missing only, and I quote:

Finally, what is missing from the book? We have only considered automatic techniques for memory management embedded in the run-time system. Thus, even when a language specification mandates garbage collection, we have not discussed in much depth other mechanisms for memory management that it may also support. The most obvious example is the use of 'regions' [Tofte and Talpin, 1994], most prominently used in the Real-Time Specification for Java. We pay attention only briefly to questions of region inferencing or stack allocation and very little at all to other compile-time analyses intended to replace, or at least assist, garbage collection. Neither do we address how best to use techniques such as reference counting in the client program, although this is popular in languages like C++. Finally, the last decade has seen little new research on distributed garbage collection. In many ways, this is a shame since we expect lessons learnt in that field also to be useful to those developing collectors for the next generation of machines with heterogeneous collections of highly non-uniform memory architectures. Nevertheless, we do not discuss distributed garbage collection here.

daokoder commented 9 years ago

I realized, that guys from Go finally decided to speed up their GC and make it "predictable" in terms of latency.

There are three scenarios for Dao GC:

In the first two scenarios, the latency caused by GC is predictable and minor, or almost so if user defined C types to not do wild things. The last scenario is rare, or at least can be avoid by non-naive programming. So I think GC is good enough for now (though something could be improved for the blocking scenario).

dumblob commented 9 years ago

To make matters worse, I realized, that we don't use the C99 keyword restrict anywhere in the Dao source code. This might make a huge difference (an order of magnitude or more) for tight loops or basically anywhere, where we work with pointers (especially when working with array-like or non-primitive types). restrict prevents from extensive (often useless) memory loads and stores, because it allows the compiler to completely optimize them out and use the CPU registers much more (not talking about the CPU caches).

I also suspect that the poor performance in the numeric benchmark pidigits (and maybe also in the binary-trees) compared to Python might be very much improved by the use of the restrict keyword.

There are nice articles about the restrict keyword:

GC might benefit from restrict quite a lot (also in the context of false sharing mentioned above).

dumblob commented 8 years ago

Today I read some details about Ulterior reference counting and it sounds really attractive in the matter of mutations being the bottle neck. URC is also described in The Garbage Collection Handbook I mentioned above in https://github.com/daokoder/dao/issues/332#issuecomment-73374111 .

dumblob commented 8 years ago

On new x86 and x86_64 CPUs (Core 2 Duo, i3, i5), we might gain at least 5x speed-up of repeated division (e.g. tight loops) by using http://libdivide.com/ instead of hardware built-in division instructions and on older or smaller (embedded) CPUs even higher speed-up (way more than 10x). According to my lame measurements on Core 2 Duo and i5, the speed-up was about 5x for the slowest scenarios and the highest was about 11x. I would like to test libdivide on i7, because it should have way faster HW divider.

As of today (2017-03-15), It seems it's not implemented in low-level compilers (C/C++ ...), because it requires advanced static analysis (see https://github.com/ridiculousfish/libdivide/issues/32#issuecomment-286642827 ). The presence of invar in conjunction with tight loops makes it easier to deploy libdivide on a higher-level - namely directly in the Dao compiler. What's left is just a decision which loop should use libdivide as there is some tiny cost of precomputation involving itself one division (up to a 128 bit one :cry:).

The main added value is, that any tight loop with division inside of the loop can't be vectorized (which effectively turns off all SIMD optimizations which are already several years clearly the main reason behind increases in CPU performance - MMX, SSE, SSE2, AVX, AVX2, AVX-512, NEON, etc.). The method in libdivide eliminates HW division completely and thus allows aggressive use of vectorization.

dumblob commented 7 years ago

Following up on GCC optimization, there seems to be an unpleasant bug in GCC - namely it doesn't at all produce good code for argument passing to functions (in comparison too Clang). It can be solved by setting -fschedule-insns and -fschedule-insns2 explicitly (they are part of -O2, but it does not work).

dumblob commented 7 years ago

Another good optimization is Profile-guided optimization (I saw 7% performance gain for database applications in 2009; in a VM I expect definitely a higher percentage - e.g. in Python it was 10% for the worst case with GCC in 2014).

Of course, there should be some Dao test code to execute for making up the profile. But this code should be just plain Dao without any modules (as they would need special tests), so it shouldn't be that difficult. Actually I expect it to be rather fun - like a detective investigation on where are the bottlenecks (GC, code sections calling/execution, routine calling, bytecode main loop, tuple copying as in the binary trees benchmark, tasklets execution, channel messages gathering/passing/copying, etc.) and what should be run more frequently. I would say The Computer Language Benchmarks Game is a good starter (with emphasis on the slowest-performing benchmarks).

Also, as Wikipedia says, this is additionally a very important part of JIT in Java VM, so in case of the DaoJIT module (be it based on manually written JIT or an existing JIT engine like the one in LLVM), we might make adding dynamic recompilation based on dynamic runtime profiling one of the few main goals of DaoJIT.

The downside of profiling in GCC was, that there were some issues with corrupted profiles from multithreaded programs. But this should be already solved.