FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.67k stars 652 forks source link

Add RISC-V vector spec v1.0 support #279

Open sh-zheng opened 2 years ago

sh-zheng commented 2 years ago

Enable rvv 1.0 with --enable-rvv. Correctness tests for FP32 and FP64 are done with bench --verify on qemu.

rdolbeau commented 2 years ago

Maybe you could first submit to https://github.com/rdolbeau/fftw3/tree/riscv-v where the code was originally developed with the EPI intrinsics.

The RVV implementation (and the SVE implementation upon which it is based in https://github.com/rdolbeau/fftw3/tree/arm-sve-clean) adds a lot of codelets by using a 'fixed-width' model, which may not be ideal for code-size and planning time. But a "scalable" implementation (full-width and/or dynamically masked) would require an adapted infrastructure, as currently the SIMD solvers require a known width at compile-time.

sh-zheng commented 2 years ago

@rdolbeau The code is indeed a prototype and need further development. I'd love to submit it to your repo first, and we can make some optimizations on it later. I found that the branch https://github.com/rdolbeau/fftw3/tree/riscv-v has not updated since 2020 and maybe out of date with the official mainline. Could you creat a new branch from mainline, where the code is latest to submit?

rdolbeau commented 2 years ago

@sh-zheng Indeed I need to bring the branch up-to-date (or create a new, clean one), I'll try to do that ASAP.

sh-zheng commented 2 years ago

@rdolbeau Thank you so much. A new, clean branch may be better, since the official rvv intrinsics are stable enough for development.

rdolbeau commented 2 years ago

@sh-zheng Sorry for the delay, been quite busy; riscv-v-clean should be up-to-date with master and just a couple of consolidated commits (rdcycle & the V support).

sh-zheng commented 2 years ago

@rdolbeau Thank you very much for the new branch. I will submit my codes into it as soon as possible.

nick-knight commented 1 year ago

What is the plan? For @sh-zheng to PR their changes to @rdolbeau's fork, and then PR that here?

a "scalable" implementation (full-width and/or dynamically masked) would require an adapted infrastructure, as currently the SIMD solvers require a known width at compile-time.

My suggestion is to contribute the fixed-width RVV support first, and then propose optimizations in follow-up patches. A fixed SIMD width seems to be a fundamental assumption of FFTW's code-gen, and breaking this assumption may require invasive changes. Additionally, both RVV and SVE could benefit from supporting scalable vectors, and it might be preferable to consider both use-cases at the same time, to ensure the changes don't preclude one ISA or the other.

sh-zheng commented 1 year ago

@nick-knight I agree. Maybe we can submit the fixed-width version first. The submit of scalable version is still in progress and may not be perfect soon.

JustToEnjoy commented 1 year ago

@rdolbeau @sh-zheng 1, I fetch branch https://github.com/rdolbeau/fftw3/tree/riscv-v-clean, and build it using C906 toolchain, but it will happen error when I --enable-r5v, Have you encountered this problem?

r5v.c:28:12: warning: implicit declaration of function 'builtin_epi_vsetvl'; did you mean 'builtin_iceil'? [-Wimplicit-function-declaration] 28 | return builtin_epi_vsetvl(rs/64, epi_e64, epi_m1) >= (rs/64); | ^~~~~~~~ | __builtin_iceilr5v.c:28:12: warning: nested extern declaration of 'builtin_epi_vsetvl' [-Wnested-externs]r5v.c:28:40: error: 'epi_e64' undeclared (first use in this function) 28 | return builtin_epi_vsetvl(rs/64, epi_e64, epi_m1) >= (rs/64); | ^~~r5v.c:28:40: note: each undeclared identifier is reported only once for each function it appears inr5v.c:28:51: error: 'epi_m1' undeclared (first use in this function) 28 | return builtin_epi_vsetvl(rs/64, epi_e64, epi_m1) >= (rs/64); | ^~r5v.c:29:3: warning: control reaches end of non-void function [-Wreturn-type] 29 | }

2,I remove --enable-r5v, it can build pass, but it can't work in C906, if I call any fftw API(for example, fftw_plan_dft_1d), the program will get stuck. No any error been print.

These two problems make me so consufed. Could you help me?

rdolbeau commented 1 year ago

@JustToEnjoy Unfortunately on my branch the code still uses EPI syntax (the European Processor Initiative project, which developed some early support for a set V intrinsics internally). It needs updating to 'standard' intrinsics. @sh-zheng had a prototype but I'm not sure targeting which set of intrinsics, or which version(s) of which compilers (llvm, gcc) would have support.

nick-knight commented 1 year ago

SiFive has successfully built FFTW from this PR branch: it passes tests, and we are pointing our customers at it. We used an in-house LLVM toolchain, but I don't think it differs significantly from upstream LLVM w.r.t. the necessary intrinsics.

rdolbeau commented 1 year ago

@nick-knight My suggestion to @sh-zheng was to merge on my branch (riscv-v-clean) so we keep the history, then merge to upstream (here). Unfortunately it seems our calendar didn't align it seems :-(

nick-knight commented 1 year ago

@rdolbeau I have done my best to acknowledge your contribution in customer engagements. If @sh-zheng follows through on your suggestion, then I assume they will close this PR and you will open a new one. When that happens, I will start pointing customers at the new one. Or, even better, the maintainer will accept the PR and we can just use master :)

sh-zheng commented 1 year ago

So sorry for the delay since I'm in some troubles, and may not continue to move forward with the project :-( . I'd appreciate it if someone could take the project forward.

The project is now in such a state, that:

  1. If we don't very care about the vector-length adaptability of the code, the PR could be submited right now, to support the vector length no greater than 1024.
  2. If we really care the vector-length adaptability of the code, the PR should be merged into rdolbeau's branch, and need more development as the SVE version. But this may need more time.

My suggestion is that, since the main-line of the fftw has not supported the flexible vector-length, we could submit this version, as a preliminary implementation of fixed vector length, and at least make it usable. The version of flexible vector-length could be developed later, and be submitted in another PR. @nick-knight @rdolbeau

rdolbeau commented 1 year ago

@sh-zheng @nick-knight For unrelated reason I recently opened a PR for SVE (#315). I'll try to take a look at @sh-zeng modifications and see if I can merge them on my RVV branch to follow-up, as they are somewhat similar in terms of behavior and issues.

sh-zheng commented 1 year ago

Great! I think the main different between the two PRs is the simd-common.h. Maybe they can be merged and submitted synchronously. @rdolbeau

OMaghiarIMG commented 7 months ago

Hello @sh-zheng, think you might want to add dft/simd/rvv/.c and rdft/simd/rvv/.c to .gitignore.

sh-zheng commented 7 months ago

@OMaghiarIMG Thank you for the review. A new version has been updated.

rdolbeau commented 1 week ago

At last I've been able to move my own branch from the old EPI intrinsics to the current set of RISC-V V1.0 intrinsics. I've made a release package to ease testing (no need for ocaml and maintainer mode!).

It is able to compile and pass checks using either clang-18 (from Debian sid) or gcc-14 (recompiled from FSF source), running on a Banana Pi F3 SBC which features RISC-V V1.0 using 256-bits registers. It's not very fast and only has 4 GiB of RAM, but it's a lot faster than Qemu :-)

@sh-zheng branch also pass the same set of tests on the same hardware. Interestingly, the performance results I get are highly dependent on the compiler: with clang-18 my branch appear to be a bit faster, while with gcc-14 @sh-zheng branch is a bit faster. There's probably quite a bit of analytics needed to figure out what's the best way (or a reasonable compromise...) of doing single-vector/interleaved format using RISC-V V. The many different micro-architectures that are appearing will make trade-offs in the source a lot more complex than for x86-64 or aarch64...

Also, an alternate implementation would be to update and test the 'split' code, which uses two vectors for complex (one for real, one for imaginary). However this requires the V macro to hold both scalable vectors in a structure or array, which I'm not sure is supported in current compilers (the EPI compiler from the BSC does [based on clang], or a least did at some point). @rofirrim, do you know in which compiler(s) (upstream clang, upstream gcc, BSC's clang) there would be support for a struct or an array of scalable types?

rofirrim commented 1 week ago

Hi @rdolbeau,

Also, an alternate implementation would be to update and test the 'split' code, which uses two vectors for complex (one for real, one for imaginary). However this requires the V macro to hold both scalable vectors in a structure or array, which I'm not sure is supported in current compilers (the EPI compiler from the BSC does [based on clang], or a least did at some point). @rofirrim, do you know in which compiler(s) (upstream clang, upstream gcc, BSC's clang) there would be support for a struct or an array of scalable types?

I seem to recall you would be able to do that if it is acceptable to you to specify a minimum vector size (similar to what SVE does). Check: https://clang.llvm.org/docs/AttributeReference.html#riscv-rvv-vector-bits

I was wondering also if you could use a segmented tuple load/store and then extract the different vectors? Check https://www.godbolt.org/z/5h1Wj7dEf as an example of a naive complex multiplication in case it is useful.

Hope this helps.

rdolbeau commented 1 week ago

I seem to recall you would be able to do that if it is acceptable to you to specify a minimum vector size (similar to what SVE does). Check: https://clang.llvm.org/docs/AttributeReference.html#riscv-rvv-vector-bits

Unfortunately, that would require compiling each file with a different set of options. Not impossible, but certainly a burden to add to the build system...

OTOH, my memory is failing me; while I did use an array or struct, for RISC-V V with the EPI compilers, it was a tuple type: https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L73 . It should be possible to do using the current vfloat64m2_t tuple type as V, and not require any special support of feature... I need to investigate this. Only half as many registers are available, but for smaller transform the smaller overhead my be a benefit.

... or, as I just saw in the godbolt link you added, the tuple type vfloat32m1x2_t, which I didn't know existed. Plenty of options actually :-)

I was wondering also if you could use a segmented tuple load/store and then extract the different vectors?

Segmented load was the goal, it did work in simulation back then (https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L216). IIRC they were once an extension, do I read the specifications right and they are actually in the base V extension now?

rofirrim commented 1 week ago

Segmented load was the goal, it did work in simulation back then (https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L216). IIRC they were once an extension, do I read the specifications right and they are actually in the base V extension now?

Yes, they are now part of base V.

rdolbeau commented 5 days ago

New release package from my branch here.

Edit: updated from 002 to 003 (should fix F32 issue).