BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
4.71k stars 315 forks source link

Implement RVV backend #372

Closed silvanshade closed 4 months ago

oconnor663 commented 5 months ago

I have less free time for code reviews than I used to, so apologies in advance for taking a while to get to this. You might be interested in an RVV assembly implementation that I've been working on here: https://github.com/BLAKE3-team/BLAKE3/blob/guts_api/rust/guts/src/riscv_rva23u64.S. Unfortunately that branch is tied to a large refactoring, which makes it hard for me to land it in master.

silvanshade commented 5 months ago

@oconnor663 Oh cool, I didn't realize there was already some implementation work for RVV.

I'll probably give it a closer look soon but just out of curiosity, what state is it in? Any idea about the performance characteristics of it or anything else interesting to note?

Also, have you done any work on any SVE backend?

oconnor663 commented 5 months ago

(I just pushed a commit to clean up some function names, so you might need to refresh the page if you still have that .S file open.)

My implementation uses the Zbb and Zvbb extensions, so I don't think it will run on most real chips yet, even those that support V 1.0. I've been doing all the development under Qemu, so I've never done any real benchmarks, but it is passing tests. The missing work that makes it hard to land this is porting other SIMD implementations to this new API. I've done AVX-512 on that branch, but I need to do SSE2/4.1 and AVX2. There was also a minor perf regression in AVX512 that I'll need to track down. Then there are loose ends to tie up around e.g. MSVC-flavored assembly.

Most of the heavy lifting in the parallel implementation (which is what really matters for performance) is in blake3_guts_riscv_rva23u64_kernel, but that code is pretty straightforward without any significant open questions. There are more questions about how transposition should be done in calling functions like blake3_guts_riscv_rva23u64_hash_blocks, which currently uses vlsseg8e32.v. That instruction might be slow on real hardware, and I might need to experiment with doing simpler loads and then transposing in registers.

I haven't tried ARM SVE yet, no. (Also the NEON implementation in master almost certainly has some perf mistakes that someone more experienced could spot.)

silvanshade commented 5 months ago

My implementation uses the Zbb and Zvbb extensions, so I don't think it will run on most real chips yet, even those that support V 1.0. I've been doing all the development under Qemu, so I've never done any real benchmarks, but it is passing tests.

Interesting. Thanks for the information.

I've also been doing most of my experimentation under qemu. I did recently get a hold of a Pioneer (SG2042) but it only supports their 0.71 RVV and I haven't even tried to get tooling to work with that yet (in fact I've barely just gotten it to boot, heh). But it might be interesting to try and adapt what you have (sans the Zbb/Zvbb and whatever else is missing).

The missing work that makes it hard to land this is porting other SIMD implementations to this new API. I've done AVX-512 on that branch, but I need to do SSE2/4.1 and AVX2.

I'd be interested in helping with that effort if you'd like. If you could give me some pointers on where to start or whatever, I'd certainly take a look.

There are more questions about how transposition should be done in calling functions like blake3_guts_riscv_rva23u64_hash_blocks, which currently uses vlsseg8e32.v. That instruction might be slow on real hardware, and I might need to experiment with doing simpler loads and then transposing in registers.

Yeah, I noticed that. Seemed interesting. I'm also wondering how that will work out.

I haven't tried ARM SVE yet, no.

I was really kind of looking for an interesting project to try something VLA related but since it seems like you've mostly solved the RVV side, maybe I will give SVE a try instead.

(Also the NEON implementation in master almost certainly has some perf mistakes that someone more experienced could spot.)

I actually made an attempt to finish the missing parts for the NEON implementation at https://github.com/BLAKE3-team/BLAKE3/pull/369. I'm certainly not an expert though and this was my first real attempt using NEON for anything.

Like you suggested though, implementing compress didn't make any practical difference. I tried a few different approaches there but overall nothing seemed to help. I'm guessing it will be hard to get better performance without some sort of more fundamental redesign of the algorithm but I don't even know what that would look like. I suspect all the shuffling in particular is hard to make efficient for NEON.

One thing I was thinking about though, for better performance on Apple Silicon at least, is to try an implementation using Metal, but making use of the unified memory modes to try and avoid the latency issues that made the Vulkan (and SYCL version I saw elsewhere) not very usable.

Another thing I've been wondering about is whether it might be possible to use the AMX coprocessor for some parts of the algorithm, perhaps genlut in particular.

Anyway, interesting stuff. Let me know if there's some way I can help with that branch or maybe if you have some suggestions for other ideas worth exploring.