Closed joy2myself closed 1 month ago
Hi, thanks for your interest. May I ask you what's your interest in GDAL and/or RISC-V ? Perhaps you're affiliated with a RISC-V founder or some group that promotes for its adoption? I ran an informal poll on my Mastodon account in https://mastodon.social/@EvenRouault/113344940167220826. So 22 people responded: 0% use RISC-V currently, 9% might and 91% will presumably never. I would be really reluctant to having RISC-V specific code paths in our code paths:
I would be much more supportive of RISC-V optimizations going through the use of an abstraction software layer. I see that libjxl uses https://github.com/google/highway and that it has RISC-V support. That would also enable us to cover other platforms like NEON / ARM64.
Currently we have a few specific SSE/SSE2/AVX2 code paths using Intel intrinsics, either directly, or through a thin abstraction layer such as gcore/gdalsse_priv.h. I'm undecided if adopting highway would totally deprecate those code paths, or if we would keep them. It all depends if we can reach the same level of performance, and also how we deal with the external dependency.
The main candidates for accelerated code paths are alg/gdalwarpkernel.cpp, gcore/overview.cpp and CopyWord related code of gcore/rasterio.cpp
Hi @rouault,
Thank you for the detailed response! Let me introduce myself first—I’m Yin Zhang (张尹), from the Programming Language and Compilation Technology Lab (PLCT Lab) at the Intelligent Software Research Center, Institute of Software, Chinese Academy of Sciences. We are members of the RISC-V Foundation and actively involved in promoting its development. Additionally, we have some non-public projects that would benefit from using GDAL on the RISC-V platform, where performance is a key concern.
I personally have experience in various SIMD and vector-related optimizations, including RISC-V vector optimizations for OpenCV (https://github.com/opencv/opencv/commits/4.x/?author=joy2myself). I’m also working on the implementation of the <experimental/simd>
header for the libc++ standard library (https://github.com/llvm/llvm-project/commits/main/?author=joy2myself). I fully understand your caution regarding platform-specific code. If adding RISC-V specific code paths is not desirable for the GDAL upstream, we may consider maintaining a downstream fork to suit our project needs.
Alternatively, we could discuss potential frameworks for upstream optimizations in GDAL. Based on my experience in the SIMD field, I see three primary approaches for SIMD optimizations in most foundational libraries:
<experimental/simd>
header provide unified SIMD abstractions. These layers are portable across platforms and easy to use. However, this approach often requires sacrificing certain platform-specific features to ensure a unified and generic abstraction interface. As a result, it is generally not possible to achieve the highest performance in all use cases and across all target platforms. They may also introduce external dependencies.Each approach has its pros and cons, and the choice often depends on the specific needs and practical circumstances of the project. Of course, you are far more familiar with the specific requirements and real-world conditions of GDAL than I am.
Looking forward to hearing your thoughts!
Best, Yin Zhang
2. Unified abstraction layers: Libraries like Google Highway or the C++
<experimental/simd>
header provide unified SIMD abstractions.
My own inclination would go to that. Whether which approach is the preferred one would be to be determined. Is experimental/simd a sort of staging area for evolutions of the C++ standard/library. What is the status of this? The GDAL project is rather conservative and I don't think we would want to adopt a C++ feature that hasn't been officially adopted and has at least one implementation. Perhaps the topic is not mature enough yet to be considered for GDAL too.
Platform-specific code would fall for me in the https://gdal.org/en/latest/development/rfc/rfc85_policy_code_additions.html category. The GDAL project has unfortunately seen a lot of contributors over time "dump" their code to upstream and run away afterwards, leading to even more works for maintainers.
Any choice should probably go through the RFC route: https://gdal.org/en/latest/development/rfc/index.html
Custom hardware acceleration layer
I had initiated a very primitive sort of that with gcore/gdalsse_priv.h, but this is more as a convenient way of using SSE intrinsincs with C++ than intended to be cross architecture abstraction layer. Other libs such as Highway, xsimd, etc. have likely done a much better job at this.
Hi @rouault,
Regarding the status of <experimental/simd>
, yes, I think it can be understood as a staging area for evolutions of the C++ standard. Indeed, as the name suggests, <experimental/simd>
is currently under the experimental
namespace, reflecting its development stage. Once it matures, it will likely be moved to the std::simd
namespace for standardized usage. At present, there is a usable implementation of <experimental/simd>
in the libstdc++
library of the GCC compiler (starting from GCC 11.2 and above, you can directly include the header and use it). And I am currently working on another implementation within the LLVM/clang libc++
library.
I fully understand the upstream position regarding platform-specific code. After internal discussions with my team, we will carefully evaluate and determine our plan. There seem to be two possible directions at this point:
highway
for optimizations. In this case, we could submit an RFC to the upstream community and push forward with the optimization implementation, while also using the optimized version to meet our project needs.Thank you again for your detailed and thoughtful response. It has been very helpful in shaping our direction.
FYI, in https://github.com/OSGeo/gdal/pull/11202 , I've used the sse2neon.h header that works very well. Not sure if there's a similar sse2rvv.h ;-)
FYI, in #11202 , I've used the sse2neon.h header that works very well. Not sure if there's a similar sse2rvv.h ;-)
There is: sse2rvv and neon2rvv
But I wouldn't recommend using them for more than a quick initial port, because they don't allow you to take advantage of the full vector length. You'd be better of using something like highway or potentially std::simd, which allow you to write vector length agnostic generic SIMD.
From what I've seen of the codebase, I would recommend successively adding custom RVV codepaths, because the SIMD usage seems to be mostly in isolated places.
due to the absence of access to that hardware, either locally or with continuous integration platforms as provide by GitHub workflows
Some RVV 1.0 hardware is already available, see "Processors with RVV 1.0": https://camel-cdr.github.io/rvv-bench-results/index.html
You can just use qemu in the Github CI. That's even better than real hardware, because you can configure it to use different vector length and adjust some other implementation details.
or falls behind bugfixes
Yeah, that could happen if you don't have capacity to maintain it. Hopefully problems would get caught if tests are run by the CI.
See for example the RVV support that now is in gnuradio/volk for an example CI setup.
You'd be better of using something like highway or potentially std::simd, which allow you to write vector length agnostic generic SIMD.
I don't know RVV specifics, but for Intel, for SSE2 vs AVX2, in the few times I've compared in GDAL, the AVX2 boost is far from being twice the SSE2 one. For example in gcore/statistics.txt, I mention that the boost of AVX2 vs SSE2 is just 15%. But yes if you have some abstraction of the vector length, you can get that "for free".
From what I've seen of the codebase, I would recommend successively adding custom RVV codepaths, because the SIMD usage seems to be mostly in isolated places.
Did you identify specific places where that would be beneficial ? The measured runtime speed enhancement vs implementation & maintenance cost ratio to assess case by case.
I don't know RVV specifics, but for Intel, for SSE2 vs AVX2, in the few times I've compared in GDAL, the AVX2 boost is far from being twice the SSE2 one. For example in gcore/statistics.txt, I mention that the boost of AVX2 vs SSE2 is just 15%. But yes if you have some abstraction of the vector length, you can get that "for free".
The difference for RVV should be larger, because x86 CPUs try to make SSE still fast, because of the legacy code, while RVV implementations tend to not specifically optimize below their vector length.
Did you identify specific places where that would be beneficial ? The measured runtime speed enhancement vs implementation & maintenance cost ratio to assess case by case.
No I don't, because I didn't know about this project before I found this issue. I just wanted to suggest how I'd approach adding RVV optimizations.
Feature description
First off, thanks for all the amazing work on GDAL! I wanted to ask if there are any plans to optimize GDAL for the RISC-V platform, specifically using the RISC-V Vector Extension (RVV). With RISC-V gaining popularity, having RVV optimizations could potentially bring performance benefits to GDAL on that platform.
If there’s no plan yet, would this be something you’d consider? My team and I would be interested in contributing if there’s a need for testing or development in this area.
Thanks!
Additional context
No response