Add comparison of factorial in Go and Zig

bhansconnect commented 4 years ago

I added a version of factorial and parallel factorial in Zig and Go. The Zig version runs a lot faster, but that is mostly because gmp tends to be a lot faster than Go's BigInt implementation. I added a linear and parallel version of both tests just to show the gains from running in parallel.

I left out the README because I thought it would be better for you to generate the speed comparisons on your computer. You may need to increase n depending on how fast your computer is. These are the times I get with the current scripts: fact-linear-go: 12.90s fact-go: 8.21s fact-linear-zig: 1.38s fact-zig: 0.61s Also, linear Zig runs as fast as linear C.

Oh, also, the Zig parallel version does not currently compile without minor modifications to the compiler. In os.zig, it seems that sometimes it tries to return as cint as a usize and that throws errors. It is a super simple fix. If I have time, I will open an issue and submit a pull request to fix it.

Finally, I left out printing because it is super slow and really skews the results. Go's BigInt prints so horridly slow for some reason (even when just printing to /dev/null).

andrewrk commented 4 years ago

Nice! I'll play with this and type up a README and merge it.

We could also try using std.math.big.Int (cc @tiehuis)

I welcome a PR to zig with the os.zig fixes, or you can also just comment here with a diff if you don't care about commit authorship and want to just let me handle it.

bhansconnect commented 4 years ago

Can't believe I missed that zig already has big ints. Now I fell kinda silly. I'll update it sometime tomorrow and give you the information about os.zig fixes.

bhansconnect commented 4 years ago

Zigs BigInt really needs better multiplication(not surprising), but at least it works. Currently it is totally impractical for multiplying very large integers. Here are the times: fact-linear-zig: 667.84s fact-zig: 728.75s The parallel one was slower because of the multiplication ordering and how much longer multiplication takes based on the number of digits.

I decided to keep the gmp versions and also upload the raw Zig versions.

Would it be worth using gmp in the Zig standard library given how optimized it is? I feel like that may be a bad idea cause there is no way to force gmp to use one of Zig's allocators. Otherwise, there is just a lot of eventual optimization to do for BigIntegers to work well(not like that is a major concern right now given how young the language is). I might take a crack at copying how rust implements it. They don't use gmp, and the implementation isn't walls of assembly and defines.

Also, here are the changes to the compiler.

andrewrk commented 4 years ago

Also, here are the changes to the compiler.

Merged in https://github.com/ziglang/zig/commit/f749bf094249bc6b48e3359dd99715ae7fec176e, thank you very much.

I've got plans today but hoping to merge this & do a writeup on Sunday :-)

tiehuis commented 4 years ago

@bhansconnect The current multiplication implementation only uses the standard basecase algorithm. Implementing the Karatsuba algorithm would be a good start to improving this performance. It isn't too complicated either.

You can read more about these methods here: https://gmplib.org/manual/Multiplication-Algorithms.html. I also documented the some of the rust multiplication code which may be another good source.

https://github.com/rust-num/num-bigint/blob/e0eb3b44f7eed81d8dac22dc786e97566efb1786/src/algorithms.rs#L230

bhansconnect commented 4 years ago

I created a pull request with karatsuba.

Also, realized that everything runs a lot faster with the c_allocator instead of direct_allocator.

With both of the above here are the updated times: fact-linear-zig: 135s fact-zig: 125s

For the parallel version, about 80-90% of the time is spent on one core doing the final multiplication.

andrewrk commented 4 years ago

Also, realized that everything runs a lot faster with the c_allocator instead of direct_allocator.

Oh. Yeah, I should probably rename this. I've noticed quite a few people not understand this based on the name. direct_allocator is defined to do mmap for every allocation.

bhansconnect commented 4 years ago

Since this was my first project with Zig, I just copied your use of direct_allocator in the other files in this repo. Also, allocators are weird to get used to, but are really cool when it comes to error handling.

bhansconnect commented 4 years ago

Realized something very important that increases speed. It is much better to split the problem into powers of 2. I have 12 cores, so it was split into 12 parts. It runs about 4 times faster when split into 8 parts. This is because the final multiplication ends up being 400000 by 400000 digits instead of 300000 by 600000 digits.

Here are the times now: fact-gmp-zig: 0.54s fact-linear-gmp-zig: 1.37s fact-go: 6.97s fact-linear-go: 12.93s fact-zig: 33.27s fact-linear-zig: 132.35s

andrewrk commented 4 years ago

@bhansconnect I'm happy to do this writeup, but I don't want to steal your thunder. Are you interested in write access to this repository and "owning" this particular subdirectory?

bhansconnect commented 4 years ago

I am open to that. I will be unable to touch it for the next week though. Just was a fun weekend side project to mess around with.

Zig definitely seems like a cool project, and compilers are fun to mess with, so I hopefully will be able to help with more things in the future.

andrewrk commented 4 years ago

Alright, cool. I'll do an initial writeup, and then give you write access so you can then do whatever edits or rewriting you feel inspired to do.

andrewrk commented 4 years ago

Wow, haha, fact.zig is really fun to watch on my CPU graph display :grin:

andrewrk commented 4 years ago

OK see what you think @bhansconnect: https://github.com/andrewrk/zig-async-demo/tree/master/factorial

I'm excited to see what you think especially about the "await" variant that I added.

You have write access to this repo now.

bhansconnect commented 4 years ago

I really like how the async await version looks. It is so much cleaner. It just sadly can't take advantage of the same level of parallelism because of how it splits up the problem.

andrewrk commented 4 years ago

Yeah, that was an interesting find. I think normally, the straightforward async/await approach will be the best though. Imagine that you were doing more work in your application besides this one factorial problem. You'd probably only do beginCpuBoundOperation() once, before the call to factorial(300000), and then probably you'd have enough other CPU bound work to give your other cores something to do. In this case, you have avoided overhead from breaking your work into too-small pieces.

andrewrk commented 4 years ago

Oh, I want to point out one more thing too- the async await version has prevented stack overflow. With the linear version, you can provide a large enough input that will overflow the stack. But the async-await version heap-allocates recursive stack frames.

Even in the blocking version - it's planned to eventually require breaking call graph cycle loops in this way. This will work by forcing a function inside a call graph cycle to be async, which means you can then heap-allocate its frame, to break the cycle.

bhansconnect commented 4 years ago

I am really interested to see how a lot of your allocator and memory stuff pans out. It is really unique to zig. Hopefully it can control large classes of bugs without much overhead. It is definitely one of the weird pieces when first messing with zig.

andrewrk / zig-async-demo

Add comparison of factorial in Go and Zig #1