UoB-HPC / stdpar-nbody

https://research-information.bris.ac.uk/en/publications/efficient-tree-based-parallel-algorithms-for-n-body-simulations-u
MIT License
2 stars 0 forks source link

Cannot compile using AdapativeCpp #13

Open limefax opened 5 months ago

limefax commented 5 months ago

Current Situation of compiling Barnes-Hut with AdaptiveCpp

Latest commit

Compilation: AdaptiveCpp commit: b15cdcfe355be6a5f79d70a0703e67fe0afaa363 (Wed Jun 19 17:53:10 2024 +0200) Barnes-Hut commit: 58fe163b7054cf22d600b507ea03ac946982f820 (Thu Jun 27 11:13:05 2024 +0100)

/opt/adaptivecpp/bin/acpp -std=c++20 -march=native -Ofast ./src/main.cpp -o ./nbody_acpp -I/opt/cuda/targets/x86_64-linux/include --acpp-targets="generic" --acpp-stdpar

Error:

In file included from ./src/main.cpp:8:
In file included from ./src/all_pairs.h:4:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/execution:34:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/pstl.hpp:31:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/algorithm.hpp:39:
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/algorithms/algorithm.hpp:117:27: error: no matching function for call to object of type 'const (lambda at ./src/kernels.h:214:9)'
  117 |                           f(*it);
      |                           ^
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/algorithms/algorithm.hpp:114:25: note: while substituting into a lambda expression here
  114 |                         [=](sycl::id<1> id) {
      |                         ^
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/algorithm.hpp:66:26: note: in instantiation of function template specialization 'hipsycl::algorithms::for_each_n<std::ranges::iota_view<unsigned int, unsigned int>::_Iterator, unsigned int, (lambda at ./src/kernels.h:214:9)>' requested here
   66 |     hipsycl::algorithms::for_each_n(queue, first, n, f);
      |                          ^
./src/kernels.h:211:10: note: in instantiation of function template specialization 'std::for_each_n<std::ranges::iota_view<unsigned int, unsigned int>::_Iterator, unsigned int, (lambda at ./src/kernels.h:214:9)>' requested here
  211 |     std::for_each_n(
      |          ^
./src/barnes_hut.h:60:9: note: in instantiation of function template specialization 'barnes_hut_step<float, unsigned int>' requested here
   60 |         barnes_hut_step<T, Index_t>(system, arguments, tree, step == 0);
      |         ^
./src/main.cpp:52:51: note: in instantiation of function template specialization 'run_barnes_hut<float>' requested here
   52 |       return run_simulation<T>(arguments, system, run_barnes_hut<T>);
      |                                                   ^
./src/main.cpp:65:9: note: in instantiation of function template specialization 'run_precision<float>' requested here
   65 |         run_precision<float>(arguments);
      |         ^
./src/kernels.h:214:9: note: candidate function template not viable: 'this' argument has type 'const (lambda at ./src/kernels.h:214:9)', but method is not marked const
  214 |         [tree] (auto tree_index) mutable { tree.clear(tree_index); });
      |         ^
In file included from <built-in>:3:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/detail/sycl_glue.hpp:40:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/algorithms/util/allocation_cache.hpp:33:
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/vector:62:
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_algobase.h:66:
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:224:7: error: call to deleted function '__advance'
  224 |       std::__advance(__i, __d, std::__iterator_category(__i));
      |       ^~~~~~~~~~~~~~
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/algorithm.hpp:65:10: note: in instantiation of function template specialization 'std::advance<std::ranges::iota_view<unsigned long>::_Iterator, unsigned int>' requested here
   65 |     std::advance(last, std::max(n, Size{0}));
      |          ^
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/algorithm.hpp:63:20: note: while substituting into a lambda expression here
   63 |   auto offloader = [&](auto& queue) {
      |                    ^
./src/all_pairs.h:52:14: note: in instantiation of function template specialization 'std::for_each_n<std::ranges::iota_view<unsigned long>::_Iterator, unsigned int, (lambda at ./src/all_pairs.h:55:13)>' requested here
   52 |         std::for_each_n(
      |              ^
./src/main.cpp:56:51: note: in instantiation of function template specialization 'run_all_pairs_collapsed_step<float>' requested here
   56 |       return run_simulation<T>(arguments, system, run_all_pairs_collapsed_step<T>);
      |                                                   ^
./src/main.cpp:65:9: note: in instantiation of function template specialization 'run_precision<float>' requested here
   65 |         run_precision<float>(arguments);
      |         ^
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:202:5: note: candidate function [with _OutputIterator = std::ranges::iota_view<unsigned long>::_Iterator, _Distance = __int128] has been explicitly deleted
  202 |     __advance(_OutputIterator&, _Distance, output_iterator_tag) = delete;
      |     ^
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:157:5: note: candidate function template not viable: no known conversion from 'typename iterator_traits<_Iterator>::iterator_category' (aka 'std::output_iterator_tag') to 'input_iterator_tag' for 3rd argument
  157 |     __advance(_InputIterator& __i, _Distance __n, input_iterator_tag)
      |     ^                                             ~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:168:5: note: candidate function template not viable: no known conversion from 'typename iterator_traits<_Iterator>::iterator_category' (aka 'std::output_iterator_tag') to 'bidirectional_iterator_tag' for 3rd argument
  168 |     __advance(_BidirectionalIterator& __i, _Distance __n,
      |     ^
  169 |               bidirectional_iterator_tag)
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:184:5: note: candidate function template not viable: no known conversion from 'typename iterator_traits<_Iterator>::iterator_category' (aka 'std::output_iterator_tag') to 'random_access_iterator_tag' for 3rd argument
  184 |     __advance(_RandomAccessIterator& __i, _Distance __n,
      |     ^
  185 |               random_access_iterator_tag)
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:224:7: error: call to deleted function '__advance'
  224 |       std::__advance(__i, __d, std::__iterator_category(__i));
      |       ^~~~~~~~~~~~~~
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/algorithms/algorithm.hpp:116:32: note: in instantiation of function template specialization 'std::advance<std::ranges::iota_view<unsigned long>::_Iterator, unsigned long>' requested here
  116 |                           std::advance(it, id[0]);
      |                                ^
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/algorithms/algorithm.hpp:114:25: note: while substituting into a lambda expression here
  114 |                         [=](sycl::id<1> id) {
      |                         ^
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/algorithm.hpp:66:26: note: in instantiation of function template specialization 'hipsycl::algorithms::for_each_n<std::ranges::iota_view<unsigned long>::_Iterator, unsigned int, (lambda at ./src/all_pairs.h:55:13)>' requested here
   66 |     hipsycl::algorithms::for_each_n(queue, first, n, f);
      |                          ^
./src/all_pairs.h:52:14: note: in instantiation of function template specialization 'std::for_each_n<std::ranges::iota_view<unsigned long>::_Iterator, unsigned int, (lambda at ./src/all_pairs.h:55:13)>' requested here
   52 |         std::for_each_n(
      |              ^
./src/main.cpp:56:51: note: in instantiation of function template specialization 'run_all_pairs_collapsed_step<float>' requested here
   56 |       return run_simulation<T>(arguments, system, run_all_pairs_collapsed_step<T>);
      |                                                   ^
./src/main.cpp:65:9: note: in instantiation of function template specialization 'run_precision<float>' requested here
   65 |         run_precision<float>(arguments);
      |         ^
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:202:5: note: candidate function [with _OutputIterator = std::ranges::iota_view<unsigned long>::_Iterator, _Distance = __int128] has been explicitly deleted
  202 |     __advance(_OutputIterator&, _Distance, output_iterator_tag) = delete;
      |     ^
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:157:5: note: candidate function template not viable: no known conversion from 'typename iterator_traits<_Iterator>::iterator_category' (aka 'std::output_iterator_tag') to 'input_iterator_tag' for 3rd argument
  157 |     __advance(_InputIterator& __i, _Distance __n, input_iterator_tag)
      |     ^                                             ~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:168:5: note: candidate function template not viable: no known conversion from 'typename iterator_traits<_Iterator>::iterator_category' (aka 'std::output_iterator_tag') to 'bidirectional_iterator_tag' for 3rd argument
  168 |     __advance(_BidirectionalIterator& __i, _Distance __n,
      |     ^
  169 |               bidirectional_iterator_tag)
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/stl_iterator_base_funcs.h:184:5: note: candidate function template not viable: no known conversion from 'typename iterator_traits<_Iterator>::iterator_category' (aka 'std::output_iterator_tag') to 'random_access_iterator_tag' for 3rd argument
  184 |     __advance(_RandomAccessIterator& __i, _Distance __n,
      |     ^
  185 |               random_access_iterator_tag)
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from ./src/main.cpp:8:
In file included from ./src/all_pairs.h:4:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/execution:34:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/pstl.hpp:31:
In file included from /opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/algorithm.hpp:39:
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/algorithms/algorithm.hpp:117:27: error: no matching function for call to object of type 'const (lambda at ./src/kernels.h:214:9)'
  117 |                           f(*it);
      |                           ^
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/algorithms/algorithm.hpp:114:25: note: while substituting into a lambda expression here
  114 |                         [=](sycl::id<1> id) {
      |                         ^
/opt/adaptivecpp/bin/../include/AdaptiveCpp/hipSYCL/std/stdpar/pstl-impl/algorithm.hpp:66:26: note: in instantiation of function template specialization 'hipsycl::algorithms::for_each_n<std::ranges::iota_view<unsigned int, unsigned int>::_Iterator, unsigned int, (lambda at ./src/kernels.h:214:9)>' requested here
   66 |     hipsycl::algorithms::for_each_n(queue, first, n, f);
      |                          ^
./src/kernels.h:211:10: note: in instantiation of function template specialization 'std::for_each_n<std::ranges::iota_view<unsigned int, unsigned int>::_Iterator, unsigned int, (lambda at ./src/kernels.h:214:9)>' requested here
  211 |     std::for_each_n(
      |          ^
./src/barnes_hut.h:60:9: note: in instantiation of function template specialization 'barnes_hut_step<double, unsigned int>' requested here
   60 |         barnes_hut_step<T, Index_t>(system, arguments, tree, step == 0);
      |         ^
./src/main.cpp:52:51: note: in instantiation of function template specialization 'run_barnes_hut<double>' requested here
   52 |       return run_simulation<T>(arguments, system, run_barnes_hut<T>);
      |                                                   ^
./src/main.cpp:67:9: note: in instantiation of function template specialization 'run_precision<double>' requested here
   67 |         run_precision<double>(arguments);
      |         ^
./src/kernels.h:214:9: note: candidate function template not viable: 'this' argument has type 'const (lambda at ./src/kernels.h:214:9)', but method is not marked const
  214 |         [tree] (auto tree_index) mutable { tree.clear(tree_index); });
      |         ^
1 warning and 4 errors generated.

Note that there is some duplication of errors due to float and double template instantiation.

Old commit

The code can work with a slight modification on the first commit. The code compiles but does not run. Barnes-Hut commit: 6ee611213ad148f3a1c5ae523ff377c698d65ce2

$ ./nbody_acpp -s 5 -n 1000000 --print-info
Starting simulation

Tree init complete
fatal error: error in backend: Cannot select: 0x561e5252a200: ch = AtomicStore<(store release (s32) into %ir.34, addrspace 1)> 0x561e5252a660, 0x561e5252a3c0, Constant:i32<1>
  0x561e5252a3c0: i64 = add 0x561e5252a4a0, 0x561e525257b0
    0x561e5252a4a0: i64,ch = CopyFromReg 0x561e524c7580, Register:i64 %8
      0x561e5252a580: i64 = Register %8
    0x561e525257b0: i64 = shl 0x561e5252a820, Constant:i32<2>
      0x561e5252a820: i64 = AssertZext 0x561e52525820, ValueType:ch:i32
        0x561e52525820: i64,ch = CopyFromReg 0x561e524c7580, Register:i64 %1
          0x561e5252cef0: i64 = Register %1
      0x561e5252a740: i32 = Constant<2>
  0x561e5252a2e0: i32 = Constant<1>
In function: _Z18__acpp_sscp_kernelIN7hipsycl4glue15__sscp_dispatch18basic_parallel_forIZNS0_10algorithms10for_each_nIN9__gnu_cxx17__normal_iteratorIPjSt6vectorIjSaIjEEEEjZ10clear_treeIfjEDaR6SystemIT_E23AtomicQuadTreeContainerISF_T0_EEUlSF_E_EENS0_4sycl5eventERNSM_5queueESF_SJ_T1_EUlNSM_2idILi1EEEE_Li1EEEEvRKSF_
[AdaptiveCpp Error] from /home/tlc/repos/AdaptiveCpp/include/hipSYCL/glue/llvm-sscp/jit.hpp:277 @ compile(): jit::compile: Encountered errors:
0: LLVMToPtx: clang invocation failed with exit code 70

[AdaptiveCpp Error] from /home/tlc/repos/AdaptiveCpp/src/runtime/cuda/cuda_queue.cpp:705 @ submit_sscp_kernel_from_code_object(): cuda_queue: Code object construction failed
Timings:
- Build Tree 613.10 ms
- Calc mass 637.99 ms
- Calc force 913.72 ms
- Calc acceleration 3.77 ms
Tree size: 2883061
Total mass:  1.00000
fatal error: error in backend: Cannot select: 0x562496df2d90: ch = AtomicStore<(store release (s32) into %ir.34, addrspace 1)> 0x562496df31f0, 0x562496df2f50, Constant:i32<1>
  0x562496df2f50: i64 = add 0x562496df3030, 0x562496dee3a0
    0x562496df3030: i64,ch = CopyFromReg 0x562496d90580, Register:i64 %8
      0x562496df3110: i64 = Register %8
    0x562496dee3a0: i64 = shl 0x562496df33b0, Constant:i32<2>
      0x562496df33b0: i64 = AssertZext 0x562496dee410, ValueType:ch:i32
        0x562496dee410: i64,ch = CopyFromReg 0x562496d90580, Register:i64 %1
          0x562496df5a80: i64 = Register %1
      0x562496df32d0: i32 = Constant<2>
  0x562496df2e70: i32 = Constant<1>
In function: _Z18__acpp_sscp_kernelIN7hipsycl4glue15__sscp_dispatch18basic_parallel_forIZNS0_10algorithms10for_each_nIN9__gnu_cxx17__normal_iteratorIPjSt6vectorIjSaIjEEEEjZ10clear_treeIfjEDaR6SystemIT_E23AtomicQuadTreeContainerISF_T0_EEUlSF_E_EENS0_4sycl5eventERNSM_5queueESF_SJ_T1_EUlNSM_2idILi1EEEE_Li1EEEEvRKSF_
[AdaptiveCpp Error] from /home/tlc/repos/AdaptiveCpp/include/hipSYCL/glue/llvm-sscp/jit.hpp:277 @ compile(): jit::compile: Encountered errors:
0: LLVMToPtx: clang invocation failed with exit code 70

[AdaptiveCpp Error] from /home/tlc/repos/AdaptiveCpp/src/runtime/cuda/cuda_queue.cpp:705 @ submit_sscp_kernel_from_code_object(): cuda_queue: Code object construction failed
^C

Changing the last atomic in clear_tree in src/kernels.h from memory_order_release to memory_order_relaxed fixes this issue. When running it, we can see that the force is parallelised but build tree and calc mass are not.

$ ./nbody_acpp -s 5 -n 1000000 --print-info
Starting simulation

Tree init complete
Timings:
- Build Tree 559.06 ms
- Calc mass 637.86 ms
- Calc force 914.19 ms
- Calc acceleration 3.73 ms
Tree size: 2883061
Total mass:  1.00000
Timings:
- Build Tree 495.41 ms
- Calc mass 538.83 ms
- Calc force 911.42 ms
- Calc acceleration 0.39 ms
Tree size: 2882085
Total mass:  1.00000
Timings:
- Build Tree 483.94 ms
- Calc mass 535.91 ms
- Calc force 912.28 ms
- Calc acceleration 0.39 ms
Tree size: 2885153
Total mass:  1.00000
Timings:
- Build Tree 542.24 ms
- Calc mass 536.71 ms
- Calc force 910.77 ms
- Calc acceleration 0.39 ms
Tree size: 2884069
Total mass:  1.00000
Timings:
- Build Tree 487.63 ms
- Calc mass 544.22 ms
- Calc force 911.68 ms
- Calc acceleration 0.46 ms
Tree size: 2885065
Total mass:  1.00000
Done simulation
Total time: 9955.12 ms

Other things

Also during this process I came across this bug as well: https://github.com/AdaptiveCpp/AdaptiveCpp/issues/1087

illuhad commented 4 months ago

The code can work with a slight modification on the first commit.

The problem is most likely that AdaptiveCpp does not currently support mutable lambdas because SYCL does not allow mutable lambdas. I'm not sure this is something I can fix easily.

The code compiles but does not run.

Is this with https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1481 merged in? The output indicates that std::atomic calls were not correctly remapped to AdaptiveCpp builtins as they should.

Even with that PR atomics with ordering other than relaxed might not be handled correctly because LLVM dos not handle those. I need to plug in the atomic mappings as PTX inline assembly that @gonzalobg has provided.

EDIT: Support for the needed atomic orderings (but only for the specific operations that we need) is in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1529

we can see that the force is parallelised but build tree and calc mass are not.

This is likely because you don't have https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1518 which enables std::execution::par offloading if the hardware provides the necessary forward progress guarantees (which basically means: Only recent NVIDIA GPUs). If you don't have that PR it will only offload std::execution::par_unseq.