Open dumblob opened 3 years ago
Thanks for the inputs! I haven't seen this before. There is a lot to explore...
Thanks for the mimalloc suggestion, I've migrated to it, which allowed me to use pthreads. I got a linear speedup on the number of threads. That is, 8x threads = 8x faster. This is mind blowing.
That is awesome. I'm looking forward to what you'll find out next. Kind is really promising!
Btw. I didn't study the clang -O3
assembly output but I'd guess there are call
instructions used for non-inlined function calls. This seems to be one of the (negative) side effects of using C as compilation target.
Feel free to take a look at Basil where they try to not emit call
instructions but rather use their own clever management of stack data and then use the rsp
register (or equivalent on non-x86 architectures) instead of call
. This results in quite some performance gains.
Such things might (or might not) bring something to Kind. IDK.
I just read the comments in
/fast_c_runtime/busy.c
and would like to comment on some :wink:.Very good questions btw.!
Basic pthreads are quite slow, but for the beginning it's good enough because they are portable (later I'd try using work-stealing C source code as generated by the Nim compiler from Weave). Regarding thread-safe stacks it's exactly what TLS exists for - and guess what, pthreads support it of course :wink:! It's not as fast as stack, but "almost that fast".
Yes, please avoid custom memory manager at all cost for any data stored in heap or thread local storage. Take a look e.g. at mimalloc to get a glimpse of issues you'd have when rolling your own manager. So clearly avoid your own and use the system-default or bundle mimalloc (it's MIT licensed).
Actually I'd recommend using 64bit integers (though I'm not sure how would your tagging optimization perform) judging based on the experience with the speed of hashing functions in Smhasher. In Smhasher one can clearly observe, that 32bit ints perform quite slowly on 64bit machines (nowhere near "double" speed as one might naively expect) while 64bit ints perform surprisingly quickly on 32bit machines (almost as quickly as 32bit ints - don't ask me why, but it's the "magic" I was talking about in another Kind thread :wink:).
Now the question is rather whether 64bit integers should be used in addition to 32bit integers or instead of 32bit integers. This one is tough. Maybe measurements will tell us more?
I don't know about any rule of thumb nor any good generic measurements on this. I think this one absolutely has to be measured. But I'd guess, that structs of exactly the size of a cache line (or maybe even half of it - but probably rather not smaller because then the thread contention would become too high) would perform faster than using bitwise operators. But again, this one I'd certainly measure with some existing computation-intensive Kind program.
I don't know though I'd like to point out once more, that if Kind wants to be really a practical language, then the speed has to be much closer to C/Go/Rust/V than to functional languages :wink:.
Btw. note, that LLVM IR can express things which C standard can't (especially handy for performance!) - so maybe going one level lower could help for these specific demands of Kind? On the other hand, don't expect any multicore parallelism from LLVM IR - there is literally none (and I don't know of any plans to add it), so you'll anyway need a work-stealing scheduler or pthreads.