Open sh-zheng opened 2 years ago
Maybe you could first submit to https://github.com/rdolbeau/fftw3/tree/riscv-v where the code was originally developed with the EPI intrinsics.
The RVV implementation (and the SVE implementation upon which it is based in https://github.com/rdolbeau/fftw3/tree/arm-sve-clean) adds a lot of codelets by using a 'fixed-width' model, which may not be ideal for code-size and planning time. But a "scalable" implementation (full-width and/or dynamically masked) would require an adapted infrastructure, as currently the SIMD solvers require a known width at compile-time.
@rdolbeau The code is indeed a prototype and need further development. I'd love to submit it to your repo first, and we can make some optimizations on it later. I found that the branch https://github.com/rdolbeau/fftw3/tree/riscv-v has not updated since 2020 and maybe out of date with the official mainline. Could you creat a new branch from mainline, where the code is latest to submit?
@sh-zheng Indeed I need to bring the branch up-to-date (or create a new, clean one), I'll try to do that ASAP.
@rdolbeau Thank you so much. A new, clean branch may be better, since the official rvv intrinsics are stable enough for development.
@sh-zheng Sorry for the delay, been quite busy; riscv-v-clean should be up-to-date with master and just a couple of consolidated commits (rdcycle & the V support).
@rdolbeau Thank you very much for the new branch. I will submit my codes into it as soon as possible.
What is the plan? For @sh-zheng to PR their changes to @rdolbeau's fork, and then PR that here?
a "scalable" implementation (full-width and/or dynamically masked) would require an adapted infrastructure, as currently the SIMD solvers require a known width at compile-time.
My suggestion is to contribute the fixed-width RVV support first, and then propose optimizations in follow-up patches. A fixed SIMD width seems to be a fundamental assumption of FFTW's code-gen, and breaking this assumption may require invasive changes. Additionally, both RVV and SVE could benefit from supporting scalable vectors, and it might be preferable to consider both use-cases at the same time, to ensure the changes don't preclude one ISA or the other.
@nick-knight I agree. Maybe we can submit the fixed-width version first. The submit of scalable version is still in progress and may not be perfect soon.
@rdolbeau @sh-zheng 1, I fetch branch https://github.com/rdolbeau/fftw3/tree/riscv-v-clean, and build it using C906 toolchain, but it will happen error when I --enable-r5v, Have you encountered this problem?
r5v.c:28:12: warning: implicit declaration of function 'builtin_epi_vsetvl'; did you mean 'builtin_iceil'? [-Wimplicit-function-declaration] 28 | return builtin_epi_vsetvl(rs/64, epi_e64, epi_m1) >= (rs/64); | ^~~~~~~~ | __builtin_iceilr5v.c:28:12: warning: nested extern declaration of 'builtin_epi_vsetvl' [-Wnested-externs]r5v.c:28:40: error: 'epi_e64' undeclared (first use in this function) 28 | return builtin_epi_vsetvl(rs/64, epi_e64, epi_m1) >= (rs/64); | ^~~r5v.c:28:40: note: each undeclared identifier is reported only once for each function it appears inr5v.c:28:51: error: 'epi_m1' undeclared (first use in this function) 28 | return builtin_epi_vsetvl(rs/64, epi_e64, epi_m1) >= (rs/64); | ^~r5v.c:29:3: warning: control reaches end of non-void function [-Wreturn-type] 29 | }
2,I remove --enable-r5v, it can build pass, but it can't work in C906, if I call any fftw API(for example, fftw_plan_dft_1d), the program will get stuck. No any error been print.
These two problems make me so consufed. Could you help me?
@JustToEnjoy Unfortunately on my branch the code still uses EPI syntax (the European Processor Initiative project, which developed some early support for a set V intrinsics internally). It needs updating to 'standard' intrinsics. @sh-zheng had a prototype but I'm not sure targeting which set of intrinsics, or which version(s) of which compilers (llvm, gcc) would have support.
SiFive has successfully built FFTW from this PR branch: it passes tests, and we are pointing our customers at it. We used an in-house LLVM toolchain, but I don't think it differs significantly from upstream LLVM w.r.t. the necessary intrinsics.
@nick-knight My suggestion to @sh-zheng was to merge on my branch (riscv-v-clean) so we keep the history, then merge to upstream (here). Unfortunately it seems our calendar didn't align it seems :-(
@rdolbeau I have done my best to acknowledge your contribution in customer engagements. If @sh-zheng follows through on your suggestion, then I assume they will close this PR and you will open a new one. When that happens, I will start pointing customers at the new one. Or, even better, the maintainer will accept the PR and we can just use master
:)
So sorry for the delay since I'm in some troubles, and may not continue to move forward with the project :-( . I'd appreciate it if someone could take the project forward.
The project is now in such a state, that:
My suggestion is that, since the main-line of the fftw has not supported the flexible vector-length, we could submit this version, as a preliminary implementation of fixed vector length, and at least make it usable. The version of flexible vector-length could be developed later, and be submitted in another PR. @nick-knight @rdolbeau
@sh-zheng @nick-knight For unrelated reason I recently opened a PR for SVE (#315). I'll try to take a look at @sh-zeng modifications and see if I can merge them on my RVV branch to follow-up, as they are somewhat similar in terms of behavior and issues.
Great! I think the main different between the two PRs is the simd-common.h. Maybe they can be merged and submitted synchronously. @rdolbeau
Hello @sh-zheng, think you might want to add dft/simd/rvv/.c and rdft/simd/rvv/.c to .gitignore.
@OMaghiarIMG Thank you for the review. A new version has been updated.
At last I've been able to move my own branch from the old EPI intrinsics to the current set of RISC-V V1.0 intrinsics. I've made a release package to ease testing (no need for ocaml and maintainer mode!).
It is able to compile and pass checks using either clang-18 (from Debian sid) or gcc-14 (recompiled from FSF source), running on a Banana Pi F3 SBC which features RISC-V V1.0 using 256-bits registers. It's not very fast and only has 4 GiB of RAM, but it's a lot faster than Qemu :-)
@sh-zheng branch also pass the same set of tests on the same hardware. Interestingly, the performance results I get are highly dependent on the compiler: with clang-18 my branch appear to be a bit faster, while with gcc-14 @sh-zheng branch is a bit faster. There's probably quite a bit of analytics needed to figure out what's the best way (or a reasonable compromise...) of doing single-vector/interleaved format using RISC-V V. The many different micro-architectures that are appearing will make trade-offs in the source a lot more complex than for x86-64 or aarch64...
Also, an alternate implementation would be to update and test the 'split' code, which uses two vectors for complex (one for real, one for imaginary). However this requires the V macro to hold both scalable vectors in a structure or array, which I'm not sure is supported in current compilers (the EPI compiler from the BSC does [based on clang], or a least did at some point). @rofirrim, do you know in which compiler(s) (upstream clang, upstream gcc, BSC's clang) there would be support for a struct or an array of scalable types?
Hi @rdolbeau,
Also, an alternate implementation would be to update and test the 'split' code, which uses two vectors for complex (one for real, one for imaginary). However this requires the V macro to hold both scalable vectors in a structure or array, which I'm not sure is supported in current compilers (the EPI compiler from the BSC does [based on clang], or a least did at some point). @rofirrim, do you know in which compiler(s) (upstream clang, upstream gcc, BSC's clang) there would be support for a struct or an array of scalable types?
I seem to recall you would be able to do that if it is acceptable to you to specify a minimum vector size (similar to what SVE does). Check: https://clang.llvm.org/docs/AttributeReference.html#riscv-rvv-vector-bits
I was wondering also if you could use a segmented tuple load/store and then extract the different vectors? Check https://www.godbolt.org/z/5h1Wj7dEf as an example of a naive complex multiplication in case it is useful.
Hope this helps.
I seem to recall you would be able to do that if it is acceptable to you to specify a minimum vector size (similar to what SVE does). Check: https://clang.llvm.org/docs/AttributeReference.html#riscv-rvv-vector-bits
Unfortunately, that would require compiling each file with a different set of options. Not impossible, but certainly a burden to add to the build system...
OTOH, my memory is failing me; while I did use an array or struct, for RISC-V V with the EPI compilers, it was a tuple type: https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L73 . It should be possible to do using the current vfloat64m2_t tuple type as V, and not require any special support of feature... I need to investigate this. Only half as many registers are available, but for smaller transform the smaller overhead my be a benefit.
... or, as I just saw in the godbolt link you added, the tuple type vfloat32m1x2_t, which I didn't know existed. Plenty of options actually :-)
I was wondering also if you could use a segmented tuple load/store and then extract the different vectors?
Segmented load was the goal, it did work in simulation back then (https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L216). IIRC they were once an extension, do I read the specifications right and they are actually in the base V extension now?
Segmented load was the goal, it did work in simulation back then (https://github.com/rdolbeau/fftw3/blob/46763868ee386f7a94cb5863afa70519a8123d24/simd-support/simd-r5v-split.h#L216). IIRC they were once an extension, do I read the specifications right and they are actually in the base V extension now?
Yes, they are now part of base V.
New release package from my branch here.
Edit: updated from 003 to 004
Enable rvv 1.0 with --enable-rvv. Correctness tests for FP32 and FP64 are done with bench --verify on qemu.