Ideas - Githubissues

dumblob commented 3 years ago

I just read the comments in /fast_c_runtime/busy.c and would like to comment on some :wink:.

Very good questions btw.!

// The goal is to find ways to optimize this file in order to make it return // faster and thus increase the number of rewrites per second. It would be // amazing to use pthreads. For example, each call to (slow 20) is technically // independent, so, in theory, using pthreads to reduce each one in a core // should make this file almost 2x faster. But in order to do so, the allocator // must use thread-safe stacks, or something equivalent.

Basic pthreads are quite slow, but for the beginning it's good enough because they are portable (later I'd try using work-stealing C source code as generated by the Nim compiler from Weave). Regarding thread-safe stacks it's exactly what TLS exists for - and guess what, pthreads support it of course :wink:! It's not as fast as stack, but "almost that fast".

should I avoid writing my own memory manager and use malloc/free // instead?

Yes, please avoid custom memory manager at all cost for any data stored in heap or thread local storage. Take a look e.g. at mimalloc to get a glimpse of issues you'd have when rolling your own manager. So clearly avoid your own and use the system-default or bundle mimalloc (it's MIT licensed).

Should I use integer types other than uint32_t?

Actually I'd recommend using 64bit integers (though I'm not sure how would your tagging optimization perform) judging based on the experience with the speed of hashing functions in Smhasher. In Smhasher one can clearly observe, that 32bit ints perform quite slowly on 64bit machines (nowhere near "double" speed as one might naively expect) while 64bit ints perform surprisingly quickly on 32bit machines (almost as quickly as 32bit ints - don't ask me why, but it's the "magic" I was talking about in another Kind thread :wink:).

Now the question is rather whether 64bit integers should be used in addition to 32bit integers or instead of 32bit integers. This one is tough. Maybe measurements will tell us more?

Should I use structs // (with bitfields?) for pointers, instead of doing bitwise operators?

I don't know about any rule of thumb nor any good generic measurements on this. I think this one absolutely has to be measured. But I'd guess, that structs of exactly the size of a cache line (or maybe even half of it - but probably rather not smaller because then the thread contention would become too high) would perform faster than using bitwise operators. But again, this one I'd certainly measure with some existing computation-intensive Kind program.

Is there // anything else I'm doing "wrong" that could be improved?

I don't know though I'd like to point out once more, that if Kind wants to be really a practical language, then the speed has to be much closer to C/Go/Rust/V than to functional languages :wink:.

Btw. note, that LLVM IR can express things which C standard can't (especially handy for performance!) - so maybe going one level lower could help for these specific demands of Kind? On the other hand, don't expect any multicore parallelism from LLVM IR - there is literally none (and I don't know of any plans to add it), so you'll anyway need a work-stealing scheduler or pthreads.

VictorTaelin commented 3 years ago

Thanks for the inputs! I haven't seen this before. There is a lot to explore...

VictorTaelin commented 3 years ago

Thanks for the mimalloc suggestion, I've migrated to it, which allowed me to use pthreads. I got a linear speedup on the number of threads. That is, 8x threads = 8x faster. This is mind blowing.

dumblob commented 3 years ago

That is awesome. I'm looking forward to what you'll find out next. Kind is really promising!

dumblob commented 3 years ago

Btw. I didn't study the clang -O3 assembly output but I'd guess there are call instructions used for non-inlined function calls. This seems to be one of the (negative) side effects of using C as compilation target.

Feel free to take a look at Basil where they try to not emit call instructions but rather use their own clever management of stack data and then use the rsp register (or equivalent on non-x86 architectures) instead of call. This results in quite some performance gains.

Such things might (or might not) bring something to Kind. IDK.

VictorTaelin / Treeduce

Ideas #1