Align > 32 bytes - Githubissues

ax3l commented 8 years ago

reported by @Flamefire here:

Just found an interesting issue while testing. Executing the following code yields in semi-random results:

#include <curand_kernel.h>

#define __optimal_align__(byte)   \
        __align__(                \
        ((byte)==1?1:             \
        ((byte)<=2?2:             \
        ((byte)<=4?4:             \
        ((byte)<=8?8:             \
        ((byte)<=16?16:           \
        ((byte)<=32?32:           \
        ((byte)<=64?64:128        \
        ))))))))

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var

struct Foo{
PMACC_ALIGN(rng, curandStateXORWOW_t);
};

    template<class T_RNG, class T_Mapper>
    __global__ void
    testRNG(T_RNG rng, T_Mapper mapper)
    {
        if(!threadIdx.x)
            printf("value %i\n", mapper);
    }

int main(){
    int value=1;
    Foo foo;
    testRNG<<<1,1>>>(foo, value);
    cudaDeviceSynchronize();
}

This is basically a proof-of concept of a bug caused by alignment. I get the same ABI-change warning when compiling this with nvcc (7.0, 7.5) and g++4-8. In my more complex case I don't get that warning although the behaviour is the same (semi-random results, big struct somewhere)

This does not happen, if the maximum alignment is set to 32 bytes. It also does not happen, if there is another 32-byte aligned param before value. If that param is more or less aligned than 32 bytes then the bug does happen again.

psychocoderHPC commented 8 years ago

@Flamefire I can't reproduce any errors with the given example. Could you please try it again?

I used the following modules on K80 and k20:

  1) mpfr/3.1.2                    5) cmake/3.3.0                   9) hdf5-parallel/1.8.14
  2) mpc/1.0.1                     6) cuda/7.0                     10) libsplash/1.4.0
  3) gmp/5.1.1                     7) openmpi/1.8.4.kepler.cuda70  11) numactl/2.0.7
  4) gcc/4.8.2                     8) boost/1.56.0                 12) valgrind/3.8.1

Flamefire commented 8 years ago

I can't get onto the cluster queues so I verified this locally: Cuda compilation tools, release 7.0, V7.0.27 Cuda compilation tools, release 7.5, V7.5.17 g++ (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5

Result:

nvcc paramOverwrite2.cu 
paramOverwrite2.cu(31): warning: variable "foo" is used before its value is set

paramOverwrite2.cu(31): warning: variable "foo" is used before its value is set

paramOverwrite2.cu: In function ‘void testRNG(T_RNG, T_Mapper) [with T_RNG = Foo; T_Mapper = int]’:
paramOverwrite2.cu:22:1: note: The ABI for passing parameters with 64-byte alignment has changed in GCC 4.6
     testRNG(T_RNG rng, T_Mapper mapper)
 ^

 ./a.out 
value -71320066

psychocoderHPC commented 8 years ago

Could it be an driver issue?

Flamefire commented 8 years ago

Ok I was able to test it on k80 now:

Cuda compilation tools, release 7.0, V7.0.27
g++-4.8.2 (GCC) 4.8.2
nvcc paramAlign.cu 
paramAlign.cu(31): warning: variable "foo" is used before its value is set

paramAlign.cu(31): warning: variable "foo" is used before its value is set

paramAlign.cu: In Funktion »void testRNG(T_RNG, T_Mapper) [mit T_RNG = Foo; T_Mapper = int]«:
paramAlign.cu:22:1: Anmerkung: Das ABI der Parameterübergabe mit 64-Byte-Ausrichtung hat sich in GCC 4.6 geändert
     testRNG(T_RNG rng, T_Mapper mapper)
 ^
grund59@kepler028-gsi:~$ ./a.out 
/// NO OUTPUT! Check says "Invalid device ordinal"

CUDA 7.5 shows no error for this case. However I was able to reproduce this behaviour also for 7.5 by using struct Bar{char v[65];}; instead of the curandStateXORWOW_t which results in 128 byte alignment.

Made an example that tests this for all sizes:

#include <iostream>
#include <cstdio>

#define __optimal_align__(byte)   \
        __align__(                \
        ((byte)==1?1:             \
        ((byte)<=2?2:             \
        ((byte)<=4?4:             \
        ((byte)<=8?8:             \
        ((byte)<=16?16:           \
        ((byte)<=32?32:           \
        ((byte)<=64?64:128        \
        ))))))))

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var

template<size_t N>
struct array{ char v[N];};

template<size_t T_N>
struct Foo{
static const size_t N = T_N;
PMACC_ALIGN(dummy,array<N>);
};

template<class T_RNG, class T_Mapper>
__global__ void
testKernel(T_RNG rng, T_Mapper mapper)
{
    unsigned N = T_RNG::N;
    printf("%u value %i\n", N, mapper);
}

template<size_t N>
struct CallKernel{
    void operator()(int value){
        CallKernel<N-1>()(value);
        testKernel<<<1,1>>>(Foo<N>(), value);
    }
};
template<>
struct CallKernel<0>
{
    void operator()(int){}
};

int main(){
    int value=1;
    CallKernel<129>()(value);
    cudaError_t code = cudaDeviceSynchronize();
    if(code != cudaSuccess)
        std::cout << "ERROR: " << cudaGetErrorString(code) << std::endl;
}

psychocoderHPC commented 8 years ago

It look like it is not allowed to align over the cache line size.

Get cache line size in linux

getconf LEVEL1_DCACHE_LINESIZE

Flamefire commented 8 years ago

I don't think so. Output at my laptop is 64, but failures happen at 64 byte alignment. It also does not explain why CUDA 7.5 works for 64 but 7.0 does not.

psychocoderHPC commented 8 years ago

We can create a bug report for NVIDIA or/and post this small example in the NVIDIA forum.

I searched for alignment restriction for the stack but only find what I also posted in the other issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44948

psychocoderHPC commented 8 years ago

@Flamefire

What I found in the PTX ISA document from NVIDIA in 5.1.1 is

Registers differ from the other state spaces in that they are not fully addressable, i.e.,
 it is not possible to refer to the address of a register. When compiling to use the Application Binary 
Interface (ABI), register variables are restricted to function scope and may not be declared at module 
scope. When compiling legacy PTX code (ISA versions prior to 3.0) containing module-scoped .reg 
variables, the compiler silently disables use of the ABI. Registers may have alignment boundaries 
required by multi-word loads and stores. - See more at: http://docs.nvidia.com/cuda/parallel-thread-
execution/index.html#sthash.p4EqxveA.dpuf

Your laptop device is sm_20 maybe therefore you have also problems with 64byte.

Flamefire commented 8 years ago

It is possible that the arch version has something to do with this issue. From the document you posted, I don't find any alignment requirement in general, as all of that refers to PTX code which is generated by nvcc, so nvcc should also handle alignment issues. If at all, one could say that large structs that should be moved at once might benefit from (manual) alignment. But I think any benefits there might be outweighted by the additional memory usage (and transfer) incurred by e.g. aligning a 44 byte struct to a 64 byte boundary. Also maximum vector size for a load operation is 128bit=32 byte. So I think any alignment above 32 bytes is counterproductive.

psychocoderHPC commented 8 years ago

I found this for gcc 4.3^^ https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Variable-Attributes.html

It says that we can ask the compiler for the maximal use full alignment.

struct Foo
{
   short array[256] __attribute__ ((aligned));
}

And in C++11 we can check the alignment with

//...
std::cout<<alignof(Foo)<<std::endl;

If I check it I get always 16 as result.

psychocoderHPC commented 8 years ago

My solution for a fix in PMacc is:

#include <boost/align/alignment_of.hpp>

namespace pmacc
{
    struct MaxAlignTestObject
    {
        char x[128] __attribute__ ((aligned));
    };

    typedef boost::alignment::alignment_of<MaxAlignTestObject> max_align_t;
}

#define PMACC_POW2_ALIGNMENT(byte) \
    ((byte)==1?1:             \
    ((byte)<=2?2:             \
    ((byte)<=4?4:             \
    ((byte)<=8?8:             \
    ((byte)<=16?16:           \
    ((byte)<=32?32:           \
    ((byte)<=64?64:128        \
    )))))))

#define __optimal_align__(byte)                                   \
        __align__(                                                \
        PMACC_POW2_ALIGNMENT(byte) <= pmacc::max_align_t::value ? \
            PMACC_POW2_ALIGNMENT(byte) :                          \
            pmacc::max_align_t::value                             \
        )

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var

psychocoderHPC commented 8 years ago

I checked my solution and I get the same alignments on host and device. The maximal alignment on hypnos k20 (64bit System) is 16 byte.

Flamefire commented 8 years ago

That seems not optimal. This would mean a maximum alignment of 16, but CUDA supports 32Byte vector loads/stores. I'd just reduce the max alignment to 32 bytes and reference the CUDA guide.

psychocoderHPC commented 8 years ago

But the problem is that it is not save to give a type which is aligned to >=32byte to a function. For that I would prefer a "save" alignment for the auto align method and if the user knows that the type is never used as parameter he/she can align is by hand.

Flamefire commented 8 years ago

I think this is not correct. Documentation for the attribute you used states align a variable or field to the maximum useful alignment for the target machine you are compiling for. Note the "useful" here. It does not state, that it is the maximum allowed value. You can still use higher alignments, you just have to make sure that interfaces are ABI compatible, which they are if they are compiled from the same source, compilers and switches. Our problem here is, that nvcc expects a different alignment strategy than g++ uses for >=64 byte alignment. Your approach would even fail for this scenario: Compiling for 512Bit SIMD host (looking at Xeon Phis) will lead to 64 byte alignment (which is the maximum usefull value here) and break when passing that to the gpu which expects a different alignment.

Side note: >= 64 byte is unsafe (currently), not >= 32 byte as of my experiments.

psychocoderHPC commented 8 years ago

I agree with you but I can't find how we can get the maximal alignment, defined by the ABI.

psychocoderHPC commented 8 years ago

I found here that the stack for x86-64 is aligned to 16 byte.

Flamefire commented 8 years ago

I'd say testing: We know >32 byte is not advantageous on the GPU due to the maximum vector size. So we can use that. If we wanted to use more, we'd need to work-around or avoid issues like "nvcc with g++>=3.4" If we want to be conservative we could say min(32, pmacc::max_align_t::value) as we are unaware of any problems.

Yes, stack is 16byte aligned. That does not mean that params passed on the stack need to be 16byte aligned. Pass a 256Byte aligned struct on the stack -> stack is aligned to 16byte.

ax3l commented 8 years ago

needs an upstream (nvidia cuda bugtracker) report to get more information, e.g., if 32byte alignment is save or not.

ax3l commented 8 years ago

note from @psychocoderHPC: We should precise our question to ask (and check again with CUDA 8.0):

Is a copy-by-value struct parameter to a `__global__` kernel call, aligned larger (>)
then 16 byte save/well-defined?
Are their any ABI restrictions regarding the fact that parameter stacks on
most (host) systems are 16 byte aligned?

We see problems with 128 byte alignment on our cluster (K80) and even with
64 byte alignment on a sm_20 device (mini-example above; icc & gcc).

Update: submitted as bug ID 1809741

Flamefire commented 8 years ago

I could even see this with 64 bytes on k80 with cuda 7.0 although this seems to be unreliable to reproduce.

ax3l commented 8 years ago

good information, thanks! but probably only CUDA 7.5+ will be relevant for upstream requests.

ax3l commented 8 years ago

let us leave this issue open for now to collect further feedback from nvidia, so we know what kind of "above 16 Byte" tunings we can apply again

ax3l commented 7 years ago

With example from Alex,

// expected output:
//   value 1 2
//   no error
#include <cstdio>
#ifndef ARRAY_SIZE
#define ARRAY_SIZE 65
#endif

struct Bar{char v[ARRAY_SIZE];};

struct Foo{
    Bar bar __align__(ALIGN);
};

__global__ void test(int i, Foo foo, int value){
    printf("value %i %i\n", i, value);
}

int main(){
    int value=2;
    Foo foo;
    test<<<1,1>>>(1, foo, value);
    printf("%s\n", cudaGetErrorString(cudaDeviceSynchronize()));
}
/*
Run on SM 3.X devices: nvcc -DALIGN=32 ok! nvcc -DALIGN=128 wrong! (value 1 2 line is missing)
Run on SM 2.0 devices you already get the unexpected behaviour when using -DALIGN=64 PTX code for both SM is 64:
64 (both systems):
.visible .entry _Z4testi3Fooi(
        .param .u32 _Z4testi3Fooi_param_0,
        .param .align 64 .b8 _Z4testi3Fooi_param_1[128],
        .param .u32 _Z4testi3Fooi_param_2
)

128:
.visible .entry _Z4testi3Fooi(
        .param .u32 _Z4testi3Fooi_param_0,
        .param .align 128 .b8 _Z4testi3Fooi_param_1[128],
        .param .u32 _Z4testi3Fooi_param_2
)
/*

the issue was reproduced locally by the support and assigned to a developer in 09/2016.

Today I pinged the support again. Response:

[..] By the way, could you please tell us why you need such large alignment (>64, or even >16)
for ByVal struct kernel parameters so that we can understand ?

ax3l commented 7 years ago

Proposed answer with @psychocoderHPC:

As soon as one uses an array of struct, SIMD access to it should be aligned for optimal access. Our objects are generic (in number and size of struct arguments) and used both on host and device. The same (struct) object that might be stored (aligned) in an array can also be used as a scalar and passed to a kernel.

ax3l commented 7 years ago

we are currently aligning up to 32 as we saw no problems up to that size and think that's conservative to run on all platforms (host & device).

nevertheless, it looks from the support answer that actually only up to 16 is guaranteed to work on nvidia GPUs...

ax3l commented 6 years ago

Got news from our ticket (Bug ID 1809741) today: The problem will be fixed in the next CUDA release, hurray!

Note: at the time of writing, CUDA 9.1 is the last release.

ax3l commented 3 years ago

There should be no upper limit anymore in CUDA and PTX errors are fixed in CUDA 11.0+. Please report if there are still errors/issues with this.

ComputationalRadiationPhysics / picongpu

Align > 32 bytes #1563