alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
353 stars 72 forks source link

OpenMP 5 accelerator does not compile with ICC #62

Closed BenjaminW3 closed 2 years ago

BenjaminW3 commented 9 years ago

The ExecCpuOmp4 / AccCpuOmp4 does not compile with ICC because: ... called in "omp target" region must appear in prior "omp declare target" pragma. This affects at least core::mapIdx, block::sync::syncBlockThreads and all the accelerator functions that can be called within a kernel. GCC compiles this correctly and does not even issue a warning (vioalting the standard). This is reasonable because it currently only supports CPU targets which does not require to build for additional back-ends.

The problem is, that the omp declare target syntax is so verbose that it can not be represented by a simple annotation in front of the functions but requires #pragma omp declare target and #pragma omp end declare target after the functions.

bussmann commented 9 years ago

How would one take care of that syntax in alpaka?

BenjaminW3 commented 9 years ago

It is hard to unify the syntax requirements for OpenMP 4 and CUDA. A possible accelerator function with template that could be called on a CUDA device, an OpenMP 4 device and on the host would look like:

#pragma omp declare target
template<
   typename TAcc>
__host__ __device__ auto doSomething(
    TAcc const & acc)
-> void
{
    // ...
}
#pragma omp end declare target

So there are 3 places where a macro would be required. It is even questionable if both OpenMP 4 and CUDA can be used together (this is a question of compiler support).

ax3l commented 9 years ago

talking about intel, do we have plans for a TBB backend (open an independent issue?)

BenjaminW3 commented 9 years ago

At least not in the near future ;-) Please open a seperate issue for it, because TBB is compiler independent.

bussmann commented 9 years ago

I think in the near future offloading will still be required, so that this is at least annoying. As for TBB - does it provide the performance on Phi?

ax3l commented 9 years ago

Afaik, TBB is the recommendation from Intel when parallelizing with C++ template meta programming. They even recommend it over OpenMP (for C++ projects).

ax3l commented 9 years ago

moved to #65

BenjaminW3 commented 9 years ago

For now, the OpenMP 4 back-end is disabled by default and has to be enabled explicitly to allow compilation with ICC.

j-stephan commented 2 years ago

@jkelling Do you know whether this is still a problem?

jkelling commented 2 years ago

@jkelling Do you know whether this is still a problem?

No idea if the new OpenMP 5 backend works with ICC. I never tested that.

However: The initial issue would be a compiler bug at least since OpenMP 4.5 as these functions are inline. Separable compilation is not supported by the new backend, which kinda addresses a potential related issue as "resolved---won't fix".

j-stephan commented 2 years ago

It appears we would need to set the appropriate compiler flags for ICPC: https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/adding-openmp-support-to-your-application.html

But it mentions that it has (partial) support for OpenMP 5.

jkelling commented 2 years ago

It appears we would need to set the appropriate compiler flags [...]

If it is only about compiler flags, then this is technically not an issue: Alpaka's cmake configuration does ATM not set the compiler flags for OpenMP5/OpenACC for any compiler completely (it does add -fopenmp for GCC/Clang, which should also be part of with Intel). README_OMP5.md has some information about the required flags for GCC, Clang and some others. Currently, these need to be added manually by the user. This also applies to Intel, i.e. if this is regarded as an issue, the fix would be to add the Intel flags to that file.

Adding complete cmake support for offloading, which entails an additional option to select the target, would be a different issue. Is there any work being done on adding support for this directly to cmake?

j-stephan commented 2 years ago

Is there any work being done on adding support for this directly to cmake?

I'm not aware of any efforts in this direction. This would likely also depend on the compiler because I imagine that each compiler will support offloading for different hardware targets.

j-stephan commented 2 years ago

I played around a bit with the current ICPC:

$ icpc --version
icpc (ICC) 2021.2.0 20210228
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

The relevant compiler flags are -qopenmp -qopenmp-offload (or -qopenmp-offload=host for disabling the offload). Leaving out -qopenmp-offload will make ICPC complain about unknown OpenMP pragmas (everything related to #pragma omp target).

However, even with -qopenmp-offload it refuses to compile/link the current alpaka HEAD. Configuration:

$ cmake -DOpenMP_CXX_VERSION=5 \
-DALPAKA_ACC_ANY_BT_OMP5_ENABLE=ON \
-DBUILD_TESTING=ON \
-Dalpaka_BUILD_EXAMPLES=ON \
-DCMAKE_CXX_FLAGS="-shared-intel -qopenmp -qopenmp-offload" \     
-DCMAKE_CXX_COMPILER=icpc ..
-- The CXX compiler identification is Intel 2021.2.0.20210228
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/icpc - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found Boost: /usr/lib64/cmake/Boost-1.76.0/BoostConfig.cmake (found suitable version "1.76.0", minimum required is "1.65.1") found components: fiber 
-- Found OpenMP_CXX: -qopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- ALPAKA_ACC_ANY_BT_OMP5_ENABLED
-- The C compiler identification is GNU 11.1.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Catch2: Using INTERNAL version 2.13.3
-- ALPAKA_DEV_COMPILE_OPTIONS: -Wall
-- Configuring done
-- Generating done
-- Build files have been written to: /home/stepha27/workspace/caravan/test/alpaka/build

Build:

$ cmake --build . --target vectorAdd
Building CXX object example/vectorAdd/CMakeFiles/vectorAdd.dir/src/vectorAdd.cpp.o
/home/stepha27/workspace/caravan/test/alpaka/include/alpaka/kernel/TaskKernelOmp5.hpp(185): warning #2570: function has not been declared with compatible "target" attribute
                              { kernelFnObj(acc, args...); },
                                ^
          detected during:
            instantiation of "auto alpaka::TaskKernelOmp5<TDim, TIdx, TKernelFnObj, TArgs...>::operator()(const alpaka::DevOmp5 &) const->void [with TDim=alpaka::DimInt<1UL>, TIdx=std::size_t={unsigned long}, TKernelFnObj=VectorAddKernel, TArgs=<alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned
                      int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>,
                      std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, const std::size_t={unsigned long} &>]" at line 251
            instantiation of "auto alpaka::traits::Enqueue<alpaka::QueueOmp5Blocking, alpaka::TaskKernelOmp5<TDim, TIdx, TKernelFnObj, TArgs...>, void>::enqueue(alpaka::QueueOmp5Blocking &, const alpaka::TaskKernelOmp5<TDim, TIdx, TKernelFnObj, TArgs...> &)->void [with TDim=alpaka::DimInt<1UL>, TIdx=std::size_t={unsigned long}, TKernelFnObj=VectorAddKernel, TArgs=<alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned
                      int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *,
                      alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, const std::size_t={unsigned long} &>]" at line 48 of "/home/stepha27/workspace/caravan/test/alpaka/include/alpaka/queue/Traits.hpp"
            instantiation of "auto alpaka::enqueue(TQueue &, TTask &&)->void [with TQueue=alpaka::Queue<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, alpaka::property::Blocking>, TTask=const alpaka::TaskKernelOmp5<alpaka::DimInt<1UL>, std::size_t={unsigned long}, VectorAddKernel, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned
                      long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>,
                      uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, const std::size_t={unsigned long} &> &]" at line 177 of "/home/stepha27/workspace/caravan/test/alpaka/example/vectorAdd/src/vectorAdd.cpp"

/home/stepha27/workspace/caravan/test/alpaka/include/alpaka/kernel/TaskKernelOmp5.hpp(185): warning #2570: *MIC* function has not been declared with compatible "target" attribute
                              { kernelFnObj(acc, args...); },
                                ^
          detected during:
            instantiation of "auto alpaka::TaskKernelOmp5<TDim, TIdx, TKernelFnObj, TArgs...>::operator()(const alpaka::DevOmp5 &) const->void [with TDim=alpaka::DimInt<1UL>, TIdx=std::size_t={unsigned long}, TKernelFnObj=VectorAddKernel, TArgs=<alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned
                      int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>,
                      std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, const std::size_t={unsigned long} &>]" at line 251
            instantiation of "auto alpaka::traits::Enqueue<alpaka::QueueOmp5Blocking, alpaka::TaskKernelOmp5<TDim, TIdx, TKernelFnObj, TArgs...>, void>::enqueue(alpaka::QueueOmp5Blocking &, const alpaka::TaskKernelOmp5<TDim, TIdx, TKernelFnObj, TArgs...> &)->void [with TDim=alpaka::DimInt<1UL>, TIdx=std::size_t={unsigned long}, TKernelFnObj=VectorAddKernel, TArgs=<alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned
                      int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *,
                      alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, const std::size_t={unsigned long} &>]" at line 48 of "/home/stepha27/workspace/caravan/test/alpaka/include/alpaka/queue/Traits.hpp"
            instantiation of "auto alpaka::enqueue(TQueue &, TTask &&)->void [with TQueue=alpaka::Queue<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, alpaka::property::Blocking>, TTask=const alpaka::TaskKernelOmp5<alpaka::DimInt<1UL>, std::size_t={unsigned long}, VectorAddKernel, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned
                      long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>, uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, alpaka::Elem<alpaka::Buf<alpaka::ExampleDefaultAcc<alpaka::DimInt<1UL>, std::size_t={unsigned long}>,
                      uint32_t={__uint32_t={unsigned int}}, alpaka::DimInt<1UL>, std::size_t={unsigned long}>>={std::remove_volatile_t<uint32_t={__uint32_t={unsigned int}}>={unsigned int}} *, const std::size_t={unsigned long} &> &]" at line 177 of "/home/stepha27/workspace/caravan/test/alpaka/example/vectorAdd/src/vectorAdd.cpp"

Linking CXX executable vectorAdd
ld: cannot find /opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/../../compiler/lib/intel64_lin/i_ofldbegin_target.o: No such file or directory
make[3]: *** [example/vectorAdd/CMakeFiles/vectorAdd.dir/build.make:98: example/vectorAdd/vectorAdd] Error 1
make[2]: *** [CMakeFiles/Makefile2:1884: example/vectorAdd/CMakeFiles/vectorAdd.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:1891: example/vectorAdd/CMakeFiles/vectorAdd.dir/rule] Error 2
make: *** [Makefile:660: vectorAdd] Error 2

Adding __attribute__((target("avx2"))) to the kernel does not change the output. I guess there is a missing pragma somewhere or something...

jkelling commented 2 years ago

Thanks for testing.

Is that the new one, based on clang? (I suggest to ignore the old one)

The warnings can be ignored: The standard states, that the declare target is optional for functions defined in the same compilation unit as the target code calling them.

The wrror:

ld: cannot find /opt/intel/oneapi/compiler/2021.2.0/linux/bin/intel64/../../compiler/lib/intel64_lin/i_ofldbegin_target.o: No such file or directory

Does not look like something that could be caused by anything in the code. it looks more like a broken installation.

j-stephan commented 2 years ago

Is that the new one, based on clang? (I suggest to ignore the old one)

Nope, that is the "classic" Intel compiler. The other one is called dpcpp or icpx (seem to be synonyms; haven't tested that one yet).

Does not look like something that could be caused by anything in the code. it looks more like a broken installation.

From what I could gather through a quick search this seems to be related to MIC (or the lack thereof on my system). My assumption is that the compiler tries to offload to a MIC target (see the warnings) and fails to find the precompiled file needed to do so.

j-stephan commented 2 years ago

Using dpcpp / icpx works out of the box.

Configuration:

$ cmake -DOpenMP_CXX_VERSION=5 \
-DALPAKA_ACC_ANY_BT_OMP5_ENABLE=ON \
-DBUILD_TESTING=ON \
-Dalpaka_BUILD_EXAMPLES=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_C_FLAGS="-shared-intel -fiopenmp -fopenmp-targets=spir64" \
-DCMAKE_CXX_COMPILER=icpx \
-DCMAKE_CXX_FLAGS="-shared-intel -fiopenmp -fopenmp-targets=spir64" \
-DCMAKE_BUILD_TYPE=Release \
..
-- The CXX compiler identification is IntelLLVM 2021.2.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2021.2.0/linux/bin/icpx - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Boost toolset is unknown (compiler IntelLLVM 2021.2.0)
-- Boost toolset is unknown (compiler IntelLLVM 2021.2.0)
-- Boost toolset is unknown (compiler IntelLLVM 2021.2.0)
-- Found Boost: /usr/lib64/cmake/Boost-1.76.0/BoostConfig.cmake (found suitable version "1.76.0", minimum required is "1.65.1") found components: fiber 
-- Found OpenMP_CXX: -fiopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- ALPAKA_ACC_ANY_BT_OMP5_ENABLED
-- The C compiler identification is IntelLLVM 2021.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/intel/oneapi/compiler/2021.2.0/linux/bin/icx - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Catch2: Using INTERNAL version 2.13.3
-- ALPAKA_DEV_COMPILE_OPTIONS: 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/stepha27/workspace/caravan/test/alpaka/build

Build (the "unused argument" is required for linking, not compiling):

$ cmake --build . --target vectorAdd
Building CXX object example/vectorAdd/CMakeFiles/vectorAdd.dir/src/vectorAdd.cpp.o
clang++: warning: argument unused during compilation: '-shared-intel' [-Wunused-command-line-argument]
Linking CXX executable vectorAdd
Built target vectorAdd

Execution:

$ ./vectorAdd     
Using alpaka accelerator: AccOmp5<1,m>
Time for kernel execution: 0.837207s
Time for HtoD copy: 3.9639e-05s
Execution results correct!

:tada: