[Issue]: standard math operations on vector types fail to build with GNU g++

Oblomov commented 7 months ago

Problem Description

The standard math operators don't work on vector types as defined by amd_hip_vector_types.h when using g++ as host compiler.

Operating System

Debian trixie/sid

CPU

AMD Ryzen 7 PRO 6850H with Radeon Graphics

GPU

AMD Instinct MI300X, AMD Instinct MI300A, AMD Instinct MI250X, AMD Instinct MI250, AMD Instinct MI210, AMD Instinct MI100, AMD Radeon Pro W7900, AMD Radeon Pro W6800, AMD Radeon Pro V620, AMD Radeon Pro VII, AMD Radeon RX 7900 XTX, AMD Radeon RX 7900 XT, AMD Radeon VII

ROCm Version

ROCm 6.0.0

ROCm Component

HIP

Steps to Reproduce

A simple test case reproducing the issue is:

#include <hip/hip_vector_types.h>
#include <iostream>

int main() {
    const float4 v1{1.0f, 2.0f, 3.0f, 4.0f};
    const float4 v2{4.0f, 3.0f, 2.0f, 1.0f};
    float4 mul = v1*v2;
    std::cout << mul.x << " " << mul.y << " " << mul.z << " " << mul.w << std::endl;
}

This builds correctly with hipcc or clang++, but not with g++, where the following error is shown instead:

g++ -Wall -Wextra  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.0.3/include -I/opt/rocm-6.0.3/lib/llvm/lib/clang/17.0.0     vecmul.cc  -lm -o vecmul
In file included from /opt/rocm-6.0.3/include/hip/hip_vector_types.h:33,
                 from vecmul.cc:1:
/opt/rocm-6.0.3/include/hip/amd_detail/amd_hip_vector_types.h: In instantiation of ‘HIP_vector_type<T, rank>& HIP_vector_type<T, rank>::operator*=(const HIP_vector_type<T, rank>&) [with T = float; unsigned int rank = 4]’:
/opt/rocm-6.0.3/include/hip/amd_detail/amd_hip_vector_types.h:551:39:   required from ‘constexpr HIP_vector_type<float, 4> operator*(HIP_vector_type<float, 4>, const HIP_vector_type<float, 4>&)’
vecmul.cc:7:18:   required from here
/opt/rocm-6.0.3/include/hip/amd_detail/amd_hip_vector_types.h:544:18: error: invalid operands of types ‘HIP_vector_base<float, 4>::Native_vec_’ {aka ‘float [4]’} and ‘const float*’ to binary ‘operator*’
  544 |             data *= x.data;
      |             ~~~~~^~~~~~~~~
/opt/rocm-6.0.3/include/hip/amd_detail/amd_hip_vector_types.h:544:18: note:   in evaluation of ‘operator*=(using HIP_vector_base<float, 4>::Native_vec_ = float [4] {aka float [4]}, const float*)’
make: *** [Makefile:11: vecmul] Error 1

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

This is most likely due to the way __NATIVE_VECTOR__ gets defined, which in this case becomes

        #define __NATIVE_VECTOR__(n, T) T[n]

The operators need to be redefined to something like

        __HOST_DEVICE__
        HIP_vector_type& operator*=(const HIP_vector_type& x) noexcept
        {
            #if __has_attribute(ext_vector_type)
                data *= x.data;
            #else
                for (auto i = 0u; i < rank; ++i)
                    data[i] *= x.data[i];
            #endif
            return *this;
        }

and similarly for all other operators.

Some notes:

this issue has been present since HIP version 4.2 at least and is still present in 6.0.3
I had previously submitted a patch to fix this and a number of other issues to the old ROCm/hipamd repository.
this is independent from the specific GPU, it's a header-level issue.

cjatin commented 7 months ago

can you share you g++ --version and g++ -dumpmachine

Oblomov commented 7 months ago

Version:

g++ (Debian 13.2.0-23) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Machine:

x86_64-linux-gnu

This supports __has_attribute (IIRC this was first introduced in GCC 5), but does not support ext_vector_type_attribute, as easily verified by

printf '#if defined(__has_attribute)\n#pragma message("has __has_attribute")\n#if __has_attribute(ext_vector_type)\n#pragma message("has ext_vector_type attribute")\n#else\n#pragma message("NO ext_vector_type_attribute")\n#endif\n#endif\n' | g++ -x c++ -c - -o /dev/null

A possible alternative to the #ifdef I mentioned, as a solution, would be to use something like __attribute__((vector_size (sizeof(T)*N))) for GCC.

ppanchad-amd commented 2 months ago

Hi @Oblomov, internal ticket has been created to investigate your issue. Thanks!

tcgu-amd commented 2 months ago

Hi @Oblomov thank you for reaching out! Is there a specific use case you have in mind that would require g++ instead of clang? The solution you provided would work, but it would treat the vectors as simple vectors instead of SIMD vectors, which can result in significant performance gaps. Hence, it is hard for us to justify supporting g++ at all.

Oblomov commented 2 months ago

Hello @tcgu-amd, thanks for getting back to me on this issue.

I first came across the issue while adding support for AMD HIP to GPUSPH (you can find some of the preliminary work in that direction in the spheric2022 branch; there has been a lot of work done since, but the branch hasn't been published yet; I can give you access to the private repository where the development is ongoing if necessary; at the moment, we're forced to ship a patched version of the vector math header because of these and other issues).

GPUSH is a very complex project with several optional dependencies that aren't always well-supported by device compilers, so the host and device side of the software are normally built with separate compilers (the default host compiler, which is usually g++, is normally used for the host part, and nvcc or hipcc for the device part, although both these options can be overridden).

Some vector types are used on host too (even if not for computationally intensive operations), so the header needs to compile successfully regardless of compiler usage. I would argue that this should be the case regardless of my specific use case, and even if this leads to suboptimal performance on host. I would like to stress that I'm referring here specifically to the host side, since the device side would still be compiled with clang-based hipcc and thus still use the ext_vector_type definition.

That being said, I have just run some simple tests and gcc does autovectorize the loops, thereby treating the vector types as SIMD types. This is probably due to the extrema being constexpr on an array of correctly-aligned data.

A trivial example is this:

#include <hip/hip_vector_types.h>
#include <iostream>

__attribute__((noinline))
float4 doit(float4 const& v1, float4 const& v2)
{
    return v1*v2;
}

int main(int argc, char* argv[]) {
    const float c1 = argc > 1 ? atof(argv[1]) : 1.0f;
    const float c2 = argc > 2 ? atof(argv[2]) : 2.0f;
    const float c3 = argc > 3 ? atof(argv[3]) : 3.0f;
    const float c4 = argc > 4 ? atof(argv[4]) : 4.0f;
    const float4 v1{c1, c2, c3, c4};
    const float4 v2{c4, c3, c2, c1};
    float4 mul = doit(v1,v2);
    std::cout << mul.x << " " << mul.y << " " << mul.z << " " << mul.w << std::endl;
}

with the header patched as suggested in the issue description. Looking at the disassembly of the doit function, it is something like:

00000000000013e0 <_Z4doitRK15HIP_vector_typeIfLj4EES2_>:
    13e0:       66 0f 6f 07             movdqa (%rdi),%xmm0
    13e4:       0f 28 16                movaps (%rsi),%xmm2
    13e7:       0f 59 d0                mulps  %xmm0,%xmm2
    13ea:       0f 29 54 24 e8          movaps %xmm2,-0x18(%rsp)
    13ef:       f3 0f 7e 4c 24 f0       movq   -0x10(%rsp),%xmm1
    13f5:       f3 0f 7e 44 24 e8       movq   -0x18(%rsp),%xmm0
    13fb:       c3                      ret
    13fc:       0f 1f 40 00             nopl   0x0(%rax)

tcgu-amd commented 2 months ago

@Oblomov Thanks for adding the context! I was not aware that gcc autovectorizes loops; this is neat. I will create an internal ticket to investigate adding this patch to the source code.

tcgu-amd commented 2 months ago

@Oblomov Good news! Seems like the functionality you suggested is already implemented! It was part of this commit https://github.com/ROCm/clr/commit/d4799b2a3f07d61d05ca7c4267a92b83b4421b06 and has been present since rocm 6.1.x https://github.com/ROCm/clr/blob/rocm-6.1.x/hipamd/include/hip/amd_detail/amd_hip_vector_types.h#L553. I just ran your reproducer on 6.2 and it was working. Would you mind trying on 6.2 (or 6.1) and let us know the result? Thanks!

Oblomov commented 2 months ago

Hello @tcgu-amd thank you very much, I have tested 6.1.0 and 6.2.1 and the issue seems to be fixed indeed. Thank you very much, I'm closing the issue since I consider it solved for me.

ROCm / clr