OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.94k stars 2.57k forks source link

Any plans for RISC-V Vector Extension (RVV) optimization? #11063

Closed joy2myself closed 1 month ago

joy2myself commented 1 month ago

Feature description

First off, thanks for all the amazing work on GDAL! I wanted to ask if there are any plans to optimize GDAL for the RISC-V platform, specifically using the RISC-V Vector Extension (RVV). With RISC-V gaining popularity, having RVV optimizations could potentially bring performance benefits to GDAL on that platform.

If there’s no plan yet, would this be something you’d consider? My team and I would be interested in contributing if there’s a need for testing or development in this area.

Thanks!

Additional context

No response

rouault commented 1 month ago

Hi, thanks for your interest. May I ask you what's your interest in GDAL and/or RISC-V ? Perhaps you're affiliated with a RISC-V founder or some group that promotes for its adoption? I ran an informal poll on my Mastodon account in https://mastodon.social/@EvenRouault/113344940167220826. So 22 people responded: 0% use RISC-V currently, 9% might and 91% will presumably never. I would be really reluctant to having RISC-V specific code paths in our code paths:

I would be much more supportive of RISC-V optimizations going through the use of an abstraction software layer. I see that libjxl uses https://github.com/google/highway and that it has RISC-V support. That would also enable us to cover other platforms like NEON / ARM64.

Currently we have a few specific SSE/SSE2/AVX2 code paths using Intel intrinsics, either directly, or through a thin abstraction layer such as gcore/gdalsse_priv.h. I'm undecided if adopting highway would totally deprecate those code paths, or if we would keep them. It all depends if we can reach the same level of performance, and also how we deal with the external dependency.

The main candidates for accelerated code paths are alg/gdalwarpkernel.cpp, gcore/overview.cpp and CopyWord related code of gcore/rasterio.cpp

joy2myself commented 1 month ago

Hi @rouault,

Thank you for the detailed response! Let me introduce myself first—I’m Yin Zhang (张尹), from the Programming Language and Compilation Technology Lab (PLCT Lab) at the Intelligent Software Research Center, Institute of Software, Chinese Academy of Sciences. We are members of the RISC-V Foundation and actively involved in promoting its development. Additionally, we have some non-public projects that would benefit from using GDAL on the RISC-V platform, where performance is a key concern.

I personally have experience in various SIMD and vector-related optimizations, including RISC-V vector optimizations for OpenCV (https://github.com/opencv/opencv/commits/4.x/?author=joy2myself). I’m also working on the implementation of the <experimental/simd> header for the libc++ standard library (https://github.com/llvm/llvm-project/commits/main/?author=joy2myself). I fully understand your caution regarding platform-specific code. If adding RISC-V specific code paths is not desirable for the GDAL upstream, we may consider maintaining a downstream fork to suit our project needs.

Alternatively, we could discuss potential frameworks for upstream optimizations in GDAL. Based on my experience in the SIMD field, I see three primary approaches for SIMD optimizations in most foundational libraries:

  1. Platform-specific code: This involves using native intrinsics or embedded assembly for each SIMD instruction set. While this approach offers the best performance, it lacks portability, requiring multiple versions of the optimized code for different platforms.
  2. Unified abstraction layers: Libraries like Google Highway or the C++ <experimental/simd> header provide unified SIMD abstractions. These layers are portable across platforms and easy to use. However, this approach often requires sacrificing certain platform-specific features to ensure a unified and generic abstraction interface. As a result, it is generally not possible to achieve the highest performance in all use cases and across all target platforms. They may also introduce external dependencies.
  3. Custom hardware acceleration layer: Similar to OpenCV’s universal intrinsics, this approach involves designing a custom abstraction layer for the specific algorithms in the library, and then optimizing each platform by platform-specific code individually. This offers both portability and high performance, but it requires significant resources to develop and maintain the custom abstraction layer. Additionally, such a layer may be tailored to the needs of a specific library and might not be as generic as other SIMD abstraction solutions.

Each approach has its pros and cons, and the choice often depends on the specific needs and practical circumstances of the project. Of course, you are far more familiar with the specific requirements and real-world conditions of GDAL than I am.

Looking forward to hearing your thoughts!

Best, Yin Zhang

rouault commented 1 month ago

2. Unified abstraction layers: Libraries like Google Highway or the C++ <experimental/simd> header provide unified SIMD abstractions.

My own inclination would go to that. Whether which approach is the preferred one would be to be determined. Is experimental/simd a sort of staging area for evolutions of the C++ standard/library. What is the status of this? The GDAL project is rather conservative and I don't think we would want to adopt a C++ feature that hasn't been officially adopted and has at least one implementation. Perhaps the topic is not mature enough yet to be considered for GDAL too.

Platform-specific code would fall for me in the https://gdal.org/en/latest/development/rfc/rfc85_policy_code_additions.html category. The GDAL project has unfortunately seen a lot of contributors over time "dump" their code to upstream and run away afterwards, leading to even more works for maintainers.

Any choice should probably go through the RFC route: https://gdal.org/en/latest/development/rfc/index.html

Custom hardware acceleration layer

I had initiated a very primitive sort of that with gcore/gdalsse_priv.h, but this is more as a convenient way of using SSE intrinsincs with C++ than intended to be cross architecture abstraction layer. Other libs such as Highway, xsimd, etc. have likely done a much better job at this.

joy2myself commented 1 month ago

Hi @rouault,

Regarding the status of <experimental/simd>, yes, I think it can be understood as a staging area for evolutions of the C++ standard. Indeed, as the name suggests, <experimental/simd> is currently under the experimental namespace, reflecting its development stage. Once it matures, it will likely be moved to the std::simd namespace for standardized usage. At present, there is a usable implementation of <experimental/simd> in the libstdc++ library of the GCC compiler (starting from GCC 11.2 and above, you can directly include the header and use it). And I am currently working on another implementation within the LLVM/clang libc++ library.

I fully understand the upstream position regarding platform-specific code. After internal discussions with my team, we will carefully evaluate and determine our plan. There seem to be two possible directions at this point:

Thank you again for your detailed and thoughtful response. It has been very helpful in shaping our direction.

rouault commented 3 weeks ago

FYI, in https://github.com/OSGeo/gdal/pull/11202 , I've used the sse2neon.h header that works very well. Not sure if there's a similar sse2rvv.h ;-)

camel-cdr commented 3 weeks ago

FYI, in #11202 , I've used the sse2neon.h header that works very well. Not sure if there's a similar sse2rvv.h ;-)

There is: sse2rvv and neon2rvv

But I wouldn't recommend using them for more than a quick initial port, because they don't allow you to take advantage of the full vector length. You'd be better of using something like highway or potentially std::simd, which allow you to write vector length agnostic generic SIMD.

From what I've seen of the codebase, I would recommend successively adding custom RVV codepaths, because the SIMD usage seems to be mostly in isolated places.

due to the absence of access to that hardware, either locally or with continuous integration platforms as provide by GitHub workflows

Some RVV 1.0 hardware is already available, see "Processors with RVV 1.0": https://camel-cdr.github.io/rvv-bench-results/index.html

You can just use qemu in the Github CI. That's even better than real hardware, because you can configure it to use different vector length and adjust some other implementation details.

or falls behind bugfixes

Yeah, that could happen if you don't have capacity to maintain it. Hopefully problems would get caught if tests are run by the CI.

See for example the RVV support that now is in gnuradio/volk for an example CI setup.

rouault commented 3 weeks ago

You'd be better of using something like highway or potentially std::simd, which allow you to write vector length agnostic generic SIMD.

I don't know RVV specifics, but for Intel, for SSE2 vs AVX2, in the few times I've compared in GDAL, the AVX2 boost is far from being twice the SSE2 one. For example in gcore/statistics.txt, I mention that the boost of AVX2 vs SSE2 is just 15%. But yes if you have some abstraction of the vector length, you can get that "for free".

From what I've seen of the codebase, I would recommend successively adding custom RVV codepaths, because the SIMD usage seems to be mostly in isolated places.

Did you identify specific places where that would be beneficial ? The measured runtime speed enhancement vs implementation & maintenance cost ratio to assess case by case.

camel-cdr commented 3 weeks ago

I don't know RVV specifics, but for Intel, for SSE2 vs AVX2, in the few times I've compared in GDAL, the AVX2 boost is far from being twice the SSE2 one. For example in gcore/statistics.txt, I mention that the boost of AVX2 vs SSE2 is just 15%. But yes if you have some abstraction of the vector length, you can get that "for free".

The difference for RVV should be larger, because x86 CPUs try to make SSE still fast, because of the legacy code, while RVV implementations tend to not specifically optimize below their vector length.

Did you identify specific places where that would be beneficial ? The measured runtime speed enhancement vs implementation & maintenance cost ratio to assess case by case.

No I don't, because I didn't know about this project before I found this issue. I just wanted to suggest how I'd approach adding RVV optimizations.