diku-dk / futhark

:boom::computer::boom: A data-parallel functional programming language
http://futhark-lang.org
ISC License
2.37k stars 164 forks source link

Non-determinisic results with reduce and hard-coded parameter #790

Closed owickstrom closed 3 years ago

owickstrom commented 5 years ago

Hey! First, thanks for working on such a nice language. I'd love to use this instead of writing OpenCL C and host code by hand.

Problem

The following avg.fut program, using reduce,

let avg (n: i64): i64 =
  reduce (+) 0 (1...n) // n

let main (n: i64): i64 = avg n

and when provided with numbers on stdin, produces correct results:

$ ./avg
1000
500i64
$ ./avg
300
150i64
$ ./avg
20
10i64

However, if I accept no argument in main, and hard-code n to 1000,

let main: i64 = avg 1000

it returns incorrect and non-deterministic results:

$ ./avg
8143760004829923i64
$ ./avg
7310159008544476i64
$ ./avg
-5411122998131223i64
$ ./avg
-536574746088645i64

I've tried messing around with --default-num-groups, thinking that it could be related to #252, but it didn't help.

Setup

Compiled by latest binary release and the following command:

$ futhark --version
Futhark 0.12.0
git: 3f7a59f (Sun Aug 4 20:57:07 2019 +0200)
Copyright (C) DIKU, University of Copenhagen, released under the ISC license.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ futhark opencl avg.fut

OS: Fedora 30 Packages: system ocl-icp and opencl-headers Device: pthread-Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz (as reported by clinfo)

Thankful for any help!

athas commented 5 years ago

Strange, I cannot reproduce locally and the intermediate code looks fine. Does it also fail when using futhark c? Does it fail with other types than i64? Does the program map (+1) (1...1000) (with i64s) also produce bogus results? What I am suspecting is that this is actually a driver issue, since nothing that Futhark does that could cause race conditions has anything to do with the element size. However, I don't think parallel processing 64-bit integers are particularly prevalent in most programs, so it is certainly possible a Futhark compiler bug is lurking.

I cannot find any information on what driver "ocl-icp" may be. Do you have more information?

athas commented 5 years ago

What platform does clinfo report?

owickstrom commented 5 years ago

Thanks for the quick feedback!

Sorry for the typo, should be ocl-icd. Here's the package manager info:

dnf info ocl-icd-devel ``` $ dnf info ocl-icd-devel Last metadata expiration check: 0:02:12 ago on Mon Aug 5 08:50:47 2019. Installed Packages Name : ocl-icd-devel Version : 2.2.12 Release : 3.fc30 Architecture : x86_64 Size : 69 k Source : ocl-icd-2.2.12-3.fc30.src.rpm Repository : @System From repo : fedora Summary : Development files for ocl-icd URL : https://forge.imag.fr/projects/ocl-icd/ License : BSD Description : This package contains the development files for ocl-icd. ```

Here's the output of clinfo:

clinfo output ``` Number of platforms 2 Platform Name Portable Computing Language Platform Vendor The pocl project Platform Version OpenCL 1.2 pocl 1.3-pre RelWithDebInfo, LLVM 8.0.0, SLEEF, DISTRO, POCL_DEBUG Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions function suffix POCL Platform Name Clover Platform Vendor Mesa Platform Version OpenCL 1.1 Mesa 19.1.3 Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions function suffix MESA Platform Name Portable Computing Language Number of devices 1 Device Name pthread-Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz Device Vendor GenuineIntel Device Vendor ID 0x6c636f70 Device Version OpenCL 1.2 pocl HSTR: pthread-x86_64-unknown-linux-gnu-skylake Driver Version 1.3-pre Device OpenCL C Version OpenCL C 1.2 pocl Device Type CPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 4 Max clock frequency 3500MHz Device Partition (core) Max number of sub-devices 4 Supported partition types equally, by counts Supported affinity domains (n/a) Max work item dimensions 3 Max work item sizes 4096x4096x4096 Max work group size 4096 Preferred work group size multiple 8 Preferred / native vector sizes char 16 / 16 short 16 / 16 int 8 / 8 long 4 / 4 half 0 / 0 (n/a) float 8 / 8 double 4 / 4 (cl_khr_fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 14422618112 (13.43GiB) Error Correction support No Max memory allocation 4294967296 (4GiB) Unified memory for Host and Device Yes Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Global Memory cache type Read/Write Global Memory cache size 4194304 (4MiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 268435456 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels Max number of read image args 128 Max number of write image args 128 Local memory type Global Local memory size 2097152 (2MiB) Max number of constant args 8 Max constant buffer size 2097152 (2MiB) Max size of kernel argument 1024 Queue properties Out-of-order execution No Profiling Yes Prefer user sync for interop Yes Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels Yes printf() buffer size 16777216 (16MiB) Built-in kernels (n/a) Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64 Platform Name Clover Number of devices 0 NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Portable Computing Language clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [POCL] clCreateContext(NULL, ...) [default] Success [POCL] clCreateContext(NULL, ...) [other] ”V clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1) Platform Name Portable Computing Language Device Name pthread-Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) Success (1) Platform Name Portable Computing Language Device Name pthread-Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1) Platform Name Portable Computing Language Device Name pthread-Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz ICD loader properties ICD loader Name OpenCL ICD Loader ICD loader Vendor OCL Icd free software ICD loader Version 2.2.12 ICD loader Profile OpenCL 2.2 ```

Gotta go to work, but I'll get back to you ASAP if you need more info.

owickstrom commented 5 years ago

Maybe it's something in pocl?

athas commented 5 years ago

Oh, pocl. I never did manage to get that to work - last time I tried, it just segfaulted. I'll see if I can build it and try again.

Futhark does actually do a slightly naughty thing with reductions to communicate between different groups, but it's done in a way that I'm surprised wouldn't work (and has worked fine on other CPU devices, too).

athas commented 5 years ago

Although I cannot possibly imagine why a communications issue would only take effect when the input size is known. That is very mysterious.

Does it also produce the wrong result for other constants?

owickstrom commented 5 years ago

Does it also produce the wrong result for other constants?

Yes, same thing.

athas commented 5 years ago

Could you try passing the option --default-num-groups=1 to the program?

athas commented 5 years ago

Oh, and does it also fail for large constants, say, one billion?

owickstrom commented 5 years ago

Could you try passing the option --default-num-groups=1 to the program?

No change.

Oh, and does it also fail for large constants, say, one billion?

Yeah, same thing.

I tried dumping the OpenCL code out from the different versions. There are a few extra arguments passed to a kernel prefixed with segred_nonseg_ in the dynamic version, but I couldn't find anything that was immediately suspicious. Maybe you want to take a look. Full files are listed below the diff.

diff avg.cl avg.1000.cl:

724,729c724,727
< __kernel void segred_nonseg_3694(int32_t num_elems_3668,
<                                  int32_t num_groups_3690, __global
<                                  unsigned char *mem_3701, __global
<                                  unsigned char *counter_mem_3705, __global
<                                  unsigned char *group_res_arr_mem_3707,
<                                  int32_t num_threads_3709)
---
> __kernel void segred_nonseg_3665(__global unsigned char *mem_3672, __global
>                                  unsigned char *counter_mem_3676, __global
>                                  unsigned char *group_res_arr_mem_3678,
>                                  int32_t num_threads_3680)
731c729,735
<     const int32_t segred_group_sizze_3678 = mainzisegred_group_sizze_3677;
---
>     const int32_t segred_group_sizze_3649 = mainzisegred_group_sizze_3648;
>     const int32_t num_groups_3661 = sext_i64_i32(smax64(1,
>                                                         smin64(sext_i32_i64(mainzisegred_max_num_groups_3651),
>                                                                squot64(1000 +
>                                                                        (sext_i32_i64(mainzisegred_group_sizze_3648) -
>                                                                         1),
>                                                                        sext_i32_i64(mainzisegred_group_sizze_3648)))));
736,776c740,780
<     ALIGNED_LOCAL_MEMORY(sync_arr_mem_3715_backing_0, 1);
<     ALIGNED_LOCAL_MEMORY(red_arr_mem_3717_backing_1, 8 *
<                          mainzisegred_group_sizze_3677);
<     
<     int32_t global_tid_3710;
<     int32_t local_tid_3711;
<     int32_t group_sizze_3714;
<     int32_t wave_sizze_3713;
<     int32_t group_tid_3712;
<     
<     global_tid_3710 = get_global_id(0);
<     local_tid_3711 = get_local_id(0);
<     group_sizze_3714 = get_local_size(0);
<     wave_sizze_3713 = LOCKSTEP_WIDTH;
<     group_tid_3712 = get_group_id(0);
<     
<     int32_t phys_tid_3694 = global_tid_3710;
<     __local char *sync_arr_mem_3715;
<     
<     sync_arr_mem_3715 = (__local char *) sync_arr_mem_3715_backing_0;
<     
<     __local char *red_arr_mem_3717;
<     
<     red_arr_mem_3717 = (__local char *) red_arr_mem_3717_backing_1;
<     
<     int32_t dummy_3692 = 0;
<     int32_t gtid_3693;
<     
<     gtid_3693 = 0;
<     
<     int64_t x_acc_3719;
<     int32_t chunk_sizze_3720 = smin32(squot32(num_elems_3668 +
<                                               segred_group_sizze_3678 *
<                                               num_groups_3690 - 1,
<                                               segred_group_sizze_3678 *
<                                               num_groups_3690),
<                                       squot32(num_elems_3668 - phys_tid_3694 +
<                                               num_threads_3709 - 1,
<                                               num_threads_3709));
<     int64_t x_3671;
<     int64_t x_3672;
---
>     ALIGNED_LOCAL_MEMORY(sync_arr_mem_3686_backing_0, 1);
>     ALIGNED_LOCAL_MEMORY(red_arr_mem_3688_backing_1, 8 *
>                          mainzisegred_group_sizze_3648);
>     
>     int32_t global_tid_3681;
>     int32_t local_tid_3682;
>     int32_t group_sizze_3685;
>     int32_t wave_sizze_3684;
>     int32_t group_tid_3683;
>     
>     global_tid_3681 = get_global_id(0);
>     local_tid_3682 = get_local_id(0);
>     group_sizze_3685 = get_local_size(0);
>     wave_sizze_3684 = LOCKSTEP_WIDTH;
>     group_tid_3683 = get_group_id(0);
>     
>     int32_t phys_tid_3665 = global_tid_3681;
>     __local char *sync_arr_mem_3686;
>     
>     sync_arr_mem_3686 = (__local char *) sync_arr_mem_3686_backing_0;
>     
>     __local char *red_arr_mem_3688;
>     
>     red_arr_mem_3688 = (__local char *) red_arr_mem_3688_backing_1;
>     
>     int32_t dummy_3663 = 0;
>     int32_t gtid_3664;
>     
>     gtid_3664 = 0;
>     
>     int64_t x_acc_3690;
>     int32_t chunk_sizze_3691 = smin32(squot32(1000 + segred_group_sizze_3649 *
>                                               num_groups_3661 - 1,
>                                               segred_group_sizze_3649 *
>                                               num_groups_3661), squot32(1000 -
>                                                                         phys_tid_3665 +
>                                                                         num_threads_3680 -
>                                                                         1,
>                                                                         num_threads_3680));
>     int64_t x_3642;
>     int64_t x_3643;
780c784
<         x_acc_3719 = 0;
---
>         x_acc_3690 = 0;
782,783c786,787
<     for (int32_t i_3724 = 0; i_3724 < chunk_sizze_3720; i_3724++) {
<         gtid_3693 = phys_tid_3694 + num_threads_3709 * i_3724;
---
>     for (int32_t i_3695 = 0; i_3695 < chunk_sizze_3691; i_3695++) {
>         gtid_3664 = phys_tid_3665 + num_threads_3680 * i_3695;
786,787c790,791
<             int64_t binop_x_3695 = sext_i32_i64(gtid_3693);
<             int64_t index_primexp_3696 = 1 + binop_x_3695;
---
>             int64_t binop_x_3666 = sext_i32_i64(gtid_3664);
>             int64_t index_primexp_3667 = 1 + binop_x_3666;
793c797
<                 x_3671 = x_acc_3719;
---
>                 x_3642 = x_acc_3690;
797c801
<                 x_3672 = index_primexp_3696;
---
>                 x_3643 = index_primexp_3667;
801c805
<                 int64_t res_3673 = x_3671 + x_3672;
---
>                 int64_t res_3644 = x_3642 + x_3643;
805c809
<                     x_acc_3719 = res_3673;
---
>                     x_acc_3690 = res_3644;
812,813c816,817
<         x_3671 = x_acc_3719;
<         ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3671;
---
>         x_3642 = x_acc_3690;
>         ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3642;
817,820c821,824
<     int32_t offset_3725;
<     int32_t skip_waves_3726;
<     int64_t x_3721;
<     int64_t x_3722;
---
>     int32_t offset_3696;
>     int32_t skip_waves_3697;
>     int64_t x_3692;
>     int64_t x_3693;
822c826
<     offset_3725 = 0;
---
>     offset_3696 = 0;
825,827c829,831
<         if (slt32(local_tid_3711, segred_group_sizze_3678)) {
<             x_3721 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 +
<                                                             offset_3725];
---
>         if (slt32(local_tid_3682, segred_group_sizze_3649)) {
>             x_3692 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 +
>                                                             offset_3696];
830,834c834,838
<     offset_3725 = 1;
<     while (slt32(offset_3725, wave_sizze_3713)) {
<         if (slt32(local_tid_3711 + offset_3725, segred_group_sizze_3678) &&
<             ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) *
<               wave_sizze_3713) & (2 * offset_3725 - 1)) == 0) {
---
>     offset_3696 = 1;
>     while (slt32(offset_3696, wave_sizze_3684)) {
>         if (slt32(local_tid_3682 + offset_3696, segred_group_sizze_3649) &&
>             ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) *
>               wave_sizze_3684) & (2 * offset_3696 - 1)) == 0) {
837,839c841,843
<                 x_3722 = ((volatile __local
<                            int64_t *) red_arr_mem_3717)[local_tid_3711 +
<                                                         offset_3725];
---
>                 x_3693 = ((volatile __local
>                            int64_t *) red_arr_mem_3688)[local_tid_3682 +
>                                                         offset_3696];
843c847
<                 int64_t res_3723 = x_3721 + x_3722;
---
>                 int64_t res_3694 = x_3692 + x_3693;
845c849
<                 x_3721 = res_3723;
---
>                 x_3692 = res_3694;
850c854
<                   int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721;
---
>                   int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692;
853c857
<         offset_3725 *= 2;
---
>         offset_3696 *= 2;
855,858c859,862
<     skip_waves_3726 = 1;
<     while (slt32(skip_waves_3726, squot32(segred_group_sizze_3678 +
<                                           wave_sizze_3713 - 1,
<                                           wave_sizze_3713))) {
---
>     skip_waves_3697 = 1;
>     while (slt32(skip_waves_3697, squot32(segred_group_sizze_3649 +
>                                           wave_sizze_3684 - 1,
>                                           wave_sizze_3684))) {
860,865c864,869
<         offset_3725 = skip_waves_3726 * wave_sizze_3713;
<         if (slt32(local_tid_3711 + offset_3725, segred_group_sizze_3678) &&
<             ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) *
<               wave_sizze_3713) == 0 && (squot32(local_tid_3711,
<                                                 wave_sizze_3713) & (2 *
<                                                                     skip_waves_3726 -
---
>         offset_3696 = skip_waves_3697 * wave_sizze_3684;
>         if (slt32(local_tid_3682 + offset_3696, segred_group_sizze_3649) &&
>             ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) *
>               wave_sizze_3684) == 0 && (squot32(local_tid_3682,
>                                                 wave_sizze_3684) & (2 *
>                                                                     skip_waves_3697 -
869,870c873,874
<                 x_3722 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 +
<                                                                 offset_3725];
---
>                 x_3693 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 +
>                                                                 offset_3696];
874c878
<                 int64_t res_3723 = x_3721 + x_3722;
---
>                 int64_t res_3694 = x_3692 + x_3693;
876c880
<                 x_3721 = res_3723;
---
>                 x_3692 = res_3694;
880c884
<                 ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721;
---
>                 ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692;
883c887
<         skip_waves_3726 *= 2;
---
>         skip_waves_3697 *= 2;
888,889c892,893
<         if (local_tid_3711 == 0) {
<             x_acc_3719 = x_3721;
---
>         if (local_tid_3682 == 0) {
>             x_acc_3690 = x_3692;
893c897
<     int32_t old_counter_3727;
---
>     int32_t old_counter_3698;
897,900c901,904
<         if (local_tid_3711 == 0) {
<             ((__global int64_t *) group_res_arr_mem_3707)[group_tid_3712 *
<                                                           segred_group_sizze_3678] =
<                 x_acc_3719;
---
>         if (local_tid_3682 == 0) {
>             ((__global int64_t *) group_res_arr_mem_3678)[group_tid_3683 *
>                                                           segred_group_sizze_3649] =
>                 x_acc_3690;
902,903c906,907
<             old_counter_3727 = atomic_add(&((volatile __global
<                                              int *) counter_mem_3705)[0],
---
>             old_counter_3698 = atomic_add(&((volatile __global
>                                              int *) counter_mem_3676)[0],
905,906c909,910
<             ((__local bool *) sync_arr_mem_3715)[0] = old_counter_3727 ==
<                 num_groups_3690 - 1;
---
>             ((__local bool *) sync_arr_mem_3686)[0] = old_counter_3698 ==
>                 num_groups_3661 - 1;
911c915
<     bool is_last_group_3728 = ((__local bool *) sync_arr_mem_3715)[0];
---
>     bool is_last_group_3699 = ((__local bool *) sync_arr_mem_3686)[0];
913,917c917,921
<     if (is_last_group_3728) {
<         if (local_tid_3711 == 0) {
<             old_counter_3727 = atomic_add(&((volatile __global
<                                              int *) counter_mem_3705)[0],
<                                           (int) (0 - num_groups_3690));
---
>     if (is_last_group_3699) {
>         if (local_tid_3682 == 0) {
>             old_counter_3698 = atomic_add(&((volatile __global
>                                              int *) counter_mem_3676)[0],
>                                           (int) (0 - num_groups_3661));
921,924c925,928
<             if (slt32(local_tid_3711, num_groups_3690)) {
<                 x_3671 = ((__global
<                            int64_t *) group_res_arr_mem_3707)[local_tid_3711 *
<                                                               segred_group_sizze_3678];
---
>             if (slt32(local_tid_3682, num_groups_3661)) {
>                 x_3642 = ((__global
>                            int64_t *) group_res_arr_mem_3678)[local_tid_3682 *
>                                                               segred_group_sizze_3649];
926c930
<                 x_3671 = 0;
---
>                 x_3642 = 0;
928c932
<             ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3671;
---
>             ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3642;
933,936c937,940
<             int32_t offset_3729;
<             int32_t skip_waves_3730;
<             int64_t x_3721;
<             int64_t x_3722;
---
>             int32_t offset_3700;
>             int32_t skip_waves_3701;
>             int64_t x_3692;
>             int64_t x_3693;
938c942
<             offset_3729 = 0;
---
>             offset_3700 = 0;
941,944c945,948
<                 if (slt32(local_tid_3711, segred_group_sizze_3678)) {
<                     x_3721 = ((__local
<                                int64_t *) red_arr_mem_3717)[local_tid_3711 +
<                                                             offset_3729];
---
>                 if (slt32(local_tid_3682, segred_group_sizze_3649)) {
>                     x_3692 = ((__local
>                                int64_t *) red_arr_mem_3688)[local_tid_3682 +
>                                                             offset_3700];
947,954c951,958
<             offset_3729 = 1;
<             while (slt32(offset_3729, wave_sizze_3713)) {
<                 if (slt32(local_tid_3711 + offset_3729,
<                           segred_group_sizze_3678) && ((local_tid_3711 -
<                                                         squot32(local_tid_3711,
<                                                                 wave_sizze_3713) *
<                                                         wave_sizze_3713) & (2 *
<                                                                             offset_3729 -
---
>             offset_3700 = 1;
>             while (slt32(offset_3700, wave_sizze_3684)) {
>                 if (slt32(local_tid_3682 + offset_3700,
>                           segred_group_sizze_3649) && ((local_tid_3682 -
>                                                         squot32(local_tid_3682,
>                                                                 wave_sizze_3684) *
>                                                         wave_sizze_3684) & (2 *
>                                                                             offset_3700 -
959,961c963,965
<                         x_3722 = ((volatile __local
<                                    int64_t *) red_arr_mem_3717)[local_tid_3711 +
<                                                                 offset_3729];
---
>                         x_3693 = ((volatile __local
>                                    int64_t *) red_arr_mem_3688)[local_tid_3682 +
>                                                                 offset_3700];
965c969
<                         int64_t res_3723 = x_3721 + x_3722;
---
>                         int64_t res_3694 = x_3692 + x_3693;
967c971
<                         x_3721 = res_3723;
---
>                         x_3692 = res_3694;
972c976
<                           int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721;
---
>                           int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692;
975c979
<                 offset_3729 *= 2;
---
>                 offset_3700 *= 2;
977,980c981,984
<             skip_waves_3730 = 1;
<             while (slt32(skip_waves_3730, squot32(segred_group_sizze_3678 +
<                                                   wave_sizze_3713 - 1,
<                                                   wave_sizze_3713))) {
---
>             skip_waves_3701 = 1;
>             while (slt32(skip_waves_3701, squot32(segred_group_sizze_3649 +
>                                                   wave_sizze_3684 - 1,
>                                                   wave_sizze_3684))) {
982,990c986,994
<                 offset_3729 = skip_waves_3730 * wave_sizze_3713;
<                 if (slt32(local_tid_3711 + offset_3729,
<                           segred_group_sizze_3678) && ((local_tid_3711 -
<                                                         squot32(local_tid_3711,
<                                                                 wave_sizze_3713) *
<                                                         wave_sizze_3713) == 0 &&
<                                                        (squot32(local_tid_3711,
<                                                                 wave_sizze_3713) &
<                                                         (2 * skip_waves_3730 -
---
>                 offset_3700 = skip_waves_3701 * wave_sizze_3684;
>                 if (slt32(local_tid_3682 + offset_3700,
>                           segred_group_sizze_3649) && ((local_tid_3682 -
>                                                         squot32(local_tid_3682,
>                                                                 wave_sizze_3684) *
>                                                         wave_sizze_3684) == 0 &&
>                                                        (squot32(local_tid_3682,
>                                                                 wave_sizze_3684) &
>                                                         (2 * skip_waves_3701 -
994,996c998,1000
<                         x_3722 = ((__local
<                                    int64_t *) red_arr_mem_3717)[local_tid_3711 +
<                                                                 offset_3729];
---
>                         x_3693 = ((__local
>                                    int64_t *) red_arr_mem_3688)[local_tid_3682 +
>                                                                 offset_3700];
1000c1004
<                         int64_t res_3723 = x_3721 + x_3722;
---
>                         int64_t res_3694 = x_3692 + x_3693;
1002c1006
<                         x_3721 = res_3723;
---
>                         x_3692 = res_3694;
1006,1007c1010,1011
<                         ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] =
<                             x_3721;
---
>                         ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] =
>                             x_3692;
1010c1014
<                 skip_waves_3730 *= 2;
---
>                 skip_waves_3701 *= 2;
1014,1015c1018,1019
<                 if (local_tid_3711 == 0) {
<                     ((__global int64_t *) mem_3701)[0] = x_3721;
---
>                 if (local_tid_3682 == 0) {
>                     ((__global int64_t *) mem_3672)[0] = x_3692;
Dynamic version ```opencl #ifdef cl_clang_storage_class_specifiers #pragma OPENCL EXTENSION cl_clang_storage_class_specifiers : enable #endif #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable __kernel void dummy_kernel(__global unsigned char *dummy, int n) { const int thread_gid = get_global_id(0); if (thread_gid >= n) return; } typedef char int8_t; typedef short int16_t; typedef int int32_t; typedef long int64_t; typedef uchar uint8_t; typedef ushort uint16_t; typedef uint uint32_t; typedef ulong uint64_t; #define ALIGNED_LOCAL_MEMORY(m,size) __local int64_t m[((size + 7) & ~7)/8] #ifdef cl_nv_pragma_unroll static inline void mem_fence_global() { asm("membar.gl;"); } #else static inline void mem_fence_global() { mem_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); } #endif static inline void mem_fence_local() { mem_fence(CLK_LOCAL_MEM_FENCE); } static inline int8_t add8(int8_t x, int8_t y) { return x + y; } static inline int16_t add16(int16_t x, int16_t y) { return x + y; } static inline int32_t add32(int32_t x, int32_t y) { return x + y; } static inline int64_t add64(int64_t x, int64_t y) { return x + y; } static inline int8_t sub8(int8_t x, int8_t y) { return x - y; } static inline int16_t sub16(int16_t x, int16_t y) { return x - y; } static inline int32_t sub32(int32_t x, int32_t y) { return x - y; } static inline int64_t sub64(int64_t x, int64_t y) { return x - y; } static inline int8_t mul8(int8_t x, int8_t y) { return x * y; } static inline int16_t mul16(int16_t x, int16_t y) { return x * y; } static inline int32_t mul32(int32_t x, int32_t y) { return x * y; } static inline int64_t mul64(int64_t x, int64_t y) { return x * y; } static inline uint8_t udiv8(uint8_t x, uint8_t y) { return x / y; } static inline uint16_t udiv16(uint16_t x, uint16_t y) { return x / y; } static inline uint32_t udiv32(uint32_t x, uint32_t y) { return x / y; } static inline uint64_t udiv64(uint64_t x, uint64_t y) { return x / y; } static inline uint8_t umod8(uint8_t x, uint8_t y) { return x % y; } static inline uint16_t umod16(uint16_t x, uint16_t y) { return x % y; } static inline uint32_t umod32(uint32_t x, uint32_t y) { return x % y; } static inline uint64_t umod64(uint64_t x, uint64_t y) { return x % y; } static inline int8_t sdiv8(int8_t x, int8_t y) { int8_t q = x / y; int8_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int16_t sdiv16(int16_t x, int16_t y) { int16_t q = x / y; int16_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int32_t sdiv32(int32_t x, int32_t y) { int32_t q = x / y; int32_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int64_t sdiv64(int64_t x, int64_t y) { int64_t q = x / y; int64_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int8_t smod8(int8_t x, int8_t y) { int8_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int16_t smod16(int16_t x, int16_t y) { int16_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int32_t smod32(int32_t x, int32_t y) { int32_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int64_t smod64(int64_t x, int64_t y) { int64_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int8_t squot8(int8_t x, int8_t y) { return x / y; } static inline int16_t squot16(int16_t x, int16_t y) { return x / y; } static inline int32_t squot32(int32_t x, int32_t y) { return x / y; } static inline int64_t squot64(int64_t x, int64_t y) { return x / y; } static inline int8_t srem8(int8_t x, int8_t y) { return x % y; } static inline int16_t srem16(int16_t x, int16_t y) { return x % y; } static inline int32_t srem32(int32_t x, int32_t y) { return x % y; } static inline int64_t srem64(int64_t x, int64_t y) { return x % y; } static inline int8_t smin8(int8_t x, int8_t y) { return x < y ? x : y; } static inline int16_t smin16(int16_t x, int16_t y) { return x < y ? x : y; } static inline int32_t smin32(int32_t x, int32_t y) { return x < y ? x : y; } static inline int64_t smin64(int64_t x, int64_t y) { return x < y ? x : y; } static inline uint8_t umin8(uint8_t x, uint8_t y) { return x < y ? x : y; } static inline uint16_t umin16(uint16_t x, uint16_t y) { return x < y ? x : y; } static inline uint32_t umin32(uint32_t x, uint32_t y) { return x < y ? x : y; } static inline uint64_t umin64(uint64_t x, uint64_t y) { return x < y ? x : y; } static inline int8_t smax8(int8_t x, int8_t y) { return x < y ? y : x; } static inline int16_t smax16(int16_t x, int16_t y) { return x < y ? y : x; } static inline int32_t smax32(int32_t x, int32_t y) { return x < y ? y : x; } static inline int64_t smax64(int64_t x, int64_t y) { return x < y ? y : x; } static inline uint8_t umax8(uint8_t x, uint8_t y) { return x < y ? y : x; } static inline uint16_t umax16(uint16_t x, uint16_t y) { return x < y ? y : x; } static inline uint32_t umax32(uint32_t x, uint32_t y) { return x < y ? y : x; } static inline uint64_t umax64(uint64_t x, uint64_t y) { return x < y ? y : x; } static inline uint8_t shl8(uint8_t x, uint8_t y) { return x << y; } static inline uint16_t shl16(uint16_t x, uint16_t y) { return x << y; } static inline uint32_t shl32(uint32_t x, uint32_t y) { return x << y; } static inline uint64_t shl64(uint64_t x, uint64_t y) { return x << y; } static inline uint8_t lshr8(uint8_t x, uint8_t y) { return x >> y; } static inline uint16_t lshr16(uint16_t x, uint16_t y) { return x >> y; } static inline uint32_t lshr32(uint32_t x, uint32_t y) { return x >> y; } static inline uint64_t lshr64(uint64_t x, uint64_t y) { return x >> y; } static inline int8_t ashr8(int8_t x, int8_t y) { return x >> y; } static inline int16_t ashr16(int16_t x, int16_t y) { return x >> y; } static inline int32_t ashr32(int32_t x, int32_t y) { return x >> y; } static inline int64_t ashr64(int64_t x, int64_t y) { return x >> y; } static inline uint8_t and8(uint8_t x, uint8_t y) { return x & y; } static inline uint16_t and16(uint16_t x, uint16_t y) { return x & y; } static inline uint32_t and32(uint32_t x, uint32_t y) { return x & y; } static inline uint64_t and64(uint64_t x, uint64_t y) { return x & y; } static inline uint8_t or8(uint8_t x, uint8_t y) { return x | y; } static inline uint16_t or16(uint16_t x, uint16_t y) { return x | y; } static inline uint32_t or32(uint32_t x, uint32_t y) { return x | y; } static inline uint64_t or64(uint64_t x, uint64_t y) { return x | y; } static inline uint8_t xor8(uint8_t x, uint8_t y) { return x ^ y; } static inline uint16_t xor16(uint16_t x, uint16_t y) { return x ^ y; } static inline uint32_t xor32(uint32_t x, uint32_t y) { return x ^ y; } static inline uint64_t xor64(uint64_t x, uint64_t y) { return x ^ y; } static inline char ult8(uint8_t x, uint8_t y) { return x < y; } static inline char ult16(uint16_t x, uint16_t y) { return x < y; } static inline char ult32(uint32_t x, uint32_t y) { return x < y; } static inline char ult64(uint64_t x, uint64_t y) { return x < y; } static inline char ule8(uint8_t x, uint8_t y) { return x <= y; } static inline char ule16(uint16_t x, uint16_t y) { return x <= y; } static inline char ule32(uint32_t x, uint32_t y) { return x <= y; } static inline char ule64(uint64_t x, uint64_t y) { return x <= y; } static inline char slt8(int8_t x, int8_t y) { return x < y; } static inline char slt16(int16_t x, int16_t y) { return x < y; } static inline char slt32(int32_t x, int32_t y) { return x < y; } static inline char slt64(int64_t x, int64_t y) { return x < y; } static inline char sle8(int8_t x, int8_t y) { return x <= y; } static inline char sle16(int16_t x, int16_t y) { return x <= y; } static inline char sle32(int32_t x, int32_t y) { return x <= y; } static inline char sle64(int64_t x, int64_t y) { return x <= y; } static inline int8_t pow8(int8_t x, int8_t y) { int8_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline int16_t pow16(int16_t x, int16_t y) { int16_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline int32_t pow32(int32_t x, int32_t y) { int32_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline int64_t pow64(int64_t x, int64_t y) { int64_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline bool itob_i8_bool(int8_t x) { return x; } static inline bool itob_i16_bool(int16_t x) { return x; } static inline bool itob_i32_bool(int32_t x) { return x; } static inline bool itob_i64_bool(int64_t x) { return x; } static inline int8_t btoi_bool_i8(bool x) { return x; } static inline int16_t btoi_bool_i16(bool x) { return x; } static inline int32_t btoi_bool_i32(bool x) { return x; } static inline int64_t btoi_bool_i64(bool x) { return x; } #define sext_i8_i8(x) ((int8_t) (int8_t) x) #define sext_i8_i16(x) ((int16_t) (int8_t) x) #define sext_i8_i32(x) ((int32_t) (int8_t) x) #define sext_i8_i64(x) ((int64_t) (int8_t) x) #define sext_i16_i8(x) ((int8_t) (int16_t) x) #define sext_i16_i16(x) ((int16_t) (int16_t) x) #define sext_i16_i32(x) ((int32_t) (int16_t) x) #define sext_i16_i64(x) ((int64_t) (int16_t) x) #define sext_i32_i8(x) ((int8_t) (int32_t) x) #define sext_i32_i16(x) ((int16_t) (int32_t) x) #define sext_i32_i32(x) ((int32_t) (int32_t) x) #define sext_i32_i64(x) ((int64_t) (int32_t) x) #define sext_i64_i8(x) ((int8_t) (int64_t) x) #define sext_i64_i16(x) ((int16_t) (int64_t) x) #define sext_i64_i32(x) ((int32_t) (int64_t) x) #define sext_i64_i64(x) ((int64_t) (int64_t) x) #define zext_i8_i8(x) ((uint8_t) (uint8_t) x) #define zext_i8_i16(x) ((uint16_t) (uint8_t) x) #define zext_i8_i32(x) ((uint32_t) (uint8_t) x) #define zext_i8_i64(x) ((uint64_t) (uint8_t) x) #define zext_i16_i8(x) ((uint8_t) (uint16_t) x) #define zext_i16_i16(x) ((uint16_t) (uint16_t) x) #define zext_i16_i32(x) ((uint32_t) (uint16_t) x) #define zext_i16_i64(x) ((uint64_t) (uint16_t) x) #define zext_i32_i8(x) ((uint8_t) (uint32_t) x) #define zext_i32_i16(x) ((uint16_t) (uint32_t) x) #define zext_i32_i32(x) ((uint32_t) (uint32_t) x) #define zext_i32_i64(x) ((uint64_t) (uint32_t) x) #define zext_i64_i8(x) ((uint8_t) (uint64_t) x) #define zext_i64_i16(x) ((uint16_t) (uint64_t) x) #define zext_i64_i32(x) ((uint32_t) (uint64_t) x) #define zext_i64_i64(x) ((uint64_t) (uint64_t) x) static inline float fdiv32(float x, float y) { return x / y; } static inline float fadd32(float x, float y) { return x + y; } static inline float fsub32(float x, float y) { return x - y; } static inline float fmul32(float x, float y) { return x * y; } static inline float fmin32(float x, float y) { return x < y ? x : y; } static inline float fmax32(float x, float y) { return x < y ? y : x; } static inline float fpow32(float x, float y) { return pow(x, y); } static inline char cmplt32(float x, float y) { return x < y; } static inline char cmple32(float x, float y) { return x <= y; } static inline float sitofp_i8_f32(int8_t x) { return x; } static inline float sitofp_i16_f32(int16_t x) { return x; } static inline float sitofp_i32_f32(int32_t x) { return x; } static inline float sitofp_i64_f32(int64_t x) { return x; } static inline float uitofp_i8_f32(uint8_t x) { return x; } static inline float uitofp_i16_f32(uint16_t x) { return x; } static inline float uitofp_i32_f32(uint32_t x) { return x; } static inline float uitofp_i64_f32(uint64_t x) { return x; } static inline int8_t fptosi_f32_i8(float x) { return x; } static inline int16_t fptosi_f32_i16(float x) { return x; } static inline int32_t fptosi_f32_i32(float x) { return x; } static inline int64_t fptosi_f32_i64(float x) { return x; } static inline uint8_t fptoui_f32_i8(float x) { return x; } static inline uint16_t fptoui_f32_i16(float x) { return x; } static inline uint32_t fptoui_f32_i32(float x) { return x; } static inline uint64_t fptoui_f32_i64(float x) { return x; } static inline float futrts_log32(float x) { return log(x); } static inline float futrts_log2_32(float x) { return log2(x); } static inline float futrts_log10_32(float x) { return log10(x); } static inline float futrts_sqrt32(float x) { return sqrt(x); } static inline float futrts_exp32(float x) { return exp(x); } static inline float futrts_cos32(float x) { return cos(x); } static inline float futrts_sin32(float x) { return sin(x); } static inline float futrts_tan32(float x) { return tan(x); } static inline float futrts_acos32(float x) { return acos(x); } static inline float futrts_asin32(float x) { return asin(x); } static inline float futrts_atan32(float x) { return atan(x); } static inline float futrts_atan2_32(float x, float y) { return atan2(x, y); } static inline float futrts_gamma32(float x) { return tgamma(x); } static inline float futrts_lgamma32(float x) { return lgamma(x); } static inline float futrts_round32(float x) { return rint(x); } static inline char futrts_isnan32(float x) { return isnan(x); } static inline char futrts_isinf32(float x) { return isinf(x); } static inline int32_t futrts_to_bits32(float x) { union { float f; int32_t t; } p; p.f = x; return p.t; } static inline float futrts_from_bits32(int32_t x) { union { int32_t f; float t; } p; p.f = x; return p.t; } __kernel void segred_nonseg_3694(int32_t num_elems_3668, int32_t num_groups_3690, __global unsigned char *mem_3701, __global unsigned char *counter_mem_3705, __global unsigned char *group_res_arr_mem_3707, int32_t num_threads_3709) { const int32_t segred_group_sizze_3678 = mainzisegred_group_sizze_3677; const int block_dim0 = 0; const int block_dim1 = 1; const int block_dim2 = 2; ALIGNED_LOCAL_MEMORY(sync_arr_mem_3715_backing_0, 1); ALIGNED_LOCAL_MEMORY(red_arr_mem_3717_backing_1, 8 * mainzisegred_group_sizze_3677); int32_t global_tid_3710; int32_t local_tid_3711; int32_t group_sizze_3714; int32_t wave_sizze_3713; int32_t group_tid_3712; global_tid_3710 = get_global_id(0); local_tid_3711 = get_local_id(0); group_sizze_3714 = get_local_size(0); wave_sizze_3713 = LOCKSTEP_WIDTH; group_tid_3712 = get_group_id(0); int32_t phys_tid_3694 = global_tid_3710; __local char *sync_arr_mem_3715; sync_arr_mem_3715 = (__local char *) sync_arr_mem_3715_backing_0; __local char *red_arr_mem_3717; red_arr_mem_3717 = (__local char *) red_arr_mem_3717_backing_1; int32_t dummy_3692 = 0; int32_t gtid_3693; gtid_3693 = 0; int64_t x_acc_3719; int32_t chunk_sizze_3720 = smin32(squot32(num_elems_3668 + segred_group_sizze_3678 * num_groups_3690 - 1, segred_group_sizze_3678 * num_groups_3690), squot32(num_elems_3668 - phys_tid_3694 + num_threads_3709 - 1, num_threads_3709)); int64_t x_3671; int64_t x_3672; // neutral-initialise the accumulators { x_acc_3719 = 0; } for (int32_t i_3724 = 0; i_3724 < chunk_sizze_3720; i_3724++) { gtid_3693 = phys_tid_3694 + num_threads_3709 * i_3724; // apply map function { int64_t binop_x_3695 = sext_i32_i64(gtid_3693); int64_t index_primexp_3696 = 1 + binop_x_3695; // save map-out results { } // load accumulator { x_3671 = x_acc_3719; } // load new values { x_3672 = index_primexp_3696; } // apply reduction operator { int64_t res_3673 = x_3671 + x_3672; // store in accumulator { x_acc_3719 = res_3673; } } } } // to reduce current chunk, first store our result in memory { x_3671 = x_acc_3719; ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3671; } barrier(CLK_LOCAL_MEM_FENCE); int32_t offset_3725; int32_t skip_waves_3726; int64_t x_3721; int64_t x_3722; offset_3725 = 0; // participating threads read initial accumulator { if (slt32(local_tid_3711, segred_group_sizze_3678)) { x_3721 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 + offset_3725]; } } offset_3725 = 1; while (slt32(offset_3725, wave_sizze_3713)) { if (slt32(local_tid_3711 + offset_3725, segred_group_sizze_3678) && ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) * wave_sizze_3713) & (2 * offset_3725 - 1)) == 0) { // read array element { x_3722 = ((volatile __local int64_t *) red_arr_mem_3717)[local_tid_3711 + offset_3725]; } // apply reduction operation { int64_t res_3723 = x_3721 + x_3722; x_3721 = res_3723; } // write result of operation { ((volatile __local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721; } } offset_3725 *= 2; } skip_waves_3726 = 1; while (slt32(skip_waves_3726, squot32(segred_group_sizze_3678 + wave_sizze_3713 - 1, wave_sizze_3713))) { barrier(CLK_LOCAL_MEM_FENCE); offset_3725 = skip_waves_3726 * wave_sizze_3713; if (slt32(local_tid_3711 + offset_3725, segred_group_sizze_3678) && ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) * wave_sizze_3713) == 0 && (squot32(local_tid_3711, wave_sizze_3713) & (2 * skip_waves_3726 - 1)) == 0)) { // read array element { x_3722 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 + offset_3725]; } // apply reduction operation { int64_t res_3723 = x_3721 + x_3722; x_3721 = res_3723; } // write result of operation { ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721; } } skip_waves_3726 *= 2; } barrier(CLK_LOCAL_MEM_FENCE); // first thread saves the result in accumulator { if (local_tid_3711 == 0) { x_acc_3719 = x_3721; } } int32_t old_counter_3727; // first thread in group saves group result to global memory { if (local_tid_3711 == 0) { ((__global int64_t *) group_res_arr_mem_3707)[group_tid_3712 * segred_group_sizze_3678] = x_acc_3719; mem_fence_global(); old_counter_3727 = atomic_add(&((volatile __global int *) counter_mem_3705)[0], (int) 1); ((__local bool *) sync_arr_mem_3715)[0] = old_counter_3727 == num_groups_3690 - 1; } } barrier(CLK_LOCAL_MEM_FENCE); bool is_last_group_3728 = ((__local bool *) sync_arr_mem_3715)[0]; if (is_last_group_3728) { if (local_tid_3711 == 0) { old_counter_3727 = atomic_add(&((volatile __global int *) counter_mem_3705)[0], (int) (0 - num_groups_3690)); } // read in the per-group-results { if (slt32(local_tid_3711, num_groups_3690)) { x_3671 = ((__global int64_t *) group_res_arr_mem_3707)[local_tid_3711 * segred_group_sizze_3678]; } else { x_3671 = 0; } ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3671; } barrier(CLK_LOCAL_MEM_FENCE); // reduce the per-group results { int32_t offset_3729; int32_t skip_waves_3730; int64_t x_3721; int64_t x_3722; offset_3729 = 0; // participating threads read initial accumulator { if (slt32(local_tid_3711, segred_group_sizze_3678)) { x_3721 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 + offset_3729]; } } offset_3729 = 1; while (slt32(offset_3729, wave_sizze_3713)) { if (slt32(local_tid_3711 + offset_3729, segred_group_sizze_3678) && ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) * wave_sizze_3713) & (2 * offset_3729 - 1)) == 0) { // read array element { x_3722 = ((volatile __local int64_t *) red_arr_mem_3717)[local_tid_3711 + offset_3729]; } // apply reduction operation { int64_t res_3723 = x_3721 + x_3722; x_3721 = res_3723; } // write result of operation { ((volatile __local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721; } } offset_3729 *= 2; } skip_waves_3730 = 1; while (slt32(skip_waves_3730, squot32(segred_group_sizze_3678 + wave_sizze_3713 - 1, wave_sizze_3713))) { barrier(CLK_LOCAL_MEM_FENCE); offset_3729 = skip_waves_3730 * wave_sizze_3713; if (slt32(local_tid_3711 + offset_3729, segred_group_sizze_3678) && ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) * wave_sizze_3713) == 0 && (squot32(local_tid_3711, wave_sizze_3713) & (2 * skip_waves_3730 - 1)) == 0)) { // read array element { x_3722 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 + offset_3729]; } // apply reduction operation { int64_t res_3723 = x_3721 + x_3722; x_3721 = res_3723; } // write result of operation { ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721; } } skip_waves_3730 *= 2; } // and back to memory with the final result { if (local_tid_3711 == 0) { ((__global int64_t *) mem_3701)[0] = x_3721; } } } } } ```
Hard-coded version ```opencl #ifdef cl_clang_storage_class_specifiers #pragma OPENCL EXTENSION cl_clang_storage_class_specifiers : enable #endif #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable __kernel void dummy_kernel(__global unsigned char *dummy, int n) { const int thread_gid = get_global_id(0); if (thread_gid >= n) return; } typedef char int8_t; typedef short int16_t; typedef int int32_t; typedef long int64_t; typedef uchar uint8_t; typedef ushort uint16_t; typedef uint uint32_t; typedef ulong uint64_t; #define ALIGNED_LOCAL_MEMORY(m,size) __local int64_t m[((size + 7) & ~7)/8] #ifdef cl_nv_pragma_unroll static inline void mem_fence_global() { asm("membar.gl;"); } #else static inline void mem_fence_global() { mem_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE); } #endif static inline void mem_fence_local() { mem_fence(CLK_LOCAL_MEM_FENCE); } static inline int8_t add8(int8_t x, int8_t y) { return x + y; } static inline int16_t add16(int16_t x, int16_t y) { return x + y; } static inline int32_t add32(int32_t x, int32_t y) { return x + y; } static inline int64_t add64(int64_t x, int64_t y) { return x + y; } static inline int8_t sub8(int8_t x, int8_t y) { return x - y; } static inline int16_t sub16(int16_t x, int16_t y) { return x - y; } static inline int32_t sub32(int32_t x, int32_t y) { return x - y; } static inline int64_t sub64(int64_t x, int64_t y) { return x - y; } static inline int8_t mul8(int8_t x, int8_t y) { return x * y; } static inline int16_t mul16(int16_t x, int16_t y) { return x * y; } static inline int32_t mul32(int32_t x, int32_t y) { return x * y; } static inline int64_t mul64(int64_t x, int64_t y) { return x * y; } static inline uint8_t udiv8(uint8_t x, uint8_t y) { return x / y; } static inline uint16_t udiv16(uint16_t x, uint16_t y) { return x / y; } static inline uint32_t udiv32(uint32_t x, uint32_t y) { return x / y; } static inline uint64_t udiv64(uint64_t x, uint64_t y) { return x / y; } static inline uint8_t umod8(uint8_t x, uint8_t y) { return x % y; } static inline uint16_t umod16(uint16_t x, uint16_t y) { return x % y; } static inline uint32_t umod32(uint32_t x, uint32_t y) { return x % y; } static inline uint64_t umod64(uint64_t x, uint64_t y) { return x % y; } static inline int8_t sdiv8(int8_t x, int8_t y) { int8_t q = x / y; int8_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int16_t sdiv16(int16_t x, int16_t y) { int16_t q = x / y; int16_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int32_t sdiv32(int32_t x, int32_t y) { int32_t q = x / y; int32_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int64_t sdiv64(int64_t x, int64_t y) { int64_t q = x / y; int64_t r = x % y; return q - ((r != 0 && r < 0 != y < 0) ? 1 : 0); } static inline int8_t smod8(int8_t x, int8_t y) { int8_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int16_t smod16(int16_t x, int16_t y) { int16_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int32_t smod32(int32_t x, int32_t y) { int32_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int64_t smod64(int64_t x, int64_t y) { int64_t r = x % y; return r + (r == 0 || (x > 0 && y > 0) || (x < 0 && y < 0) ? 0 : y); } static inline int8_t squot8(int8_t x, int8_t y) { return x / y; } static inline int16_t squot16(int16_t x, int16_t y) { return x / y; } static inline int32_t squot32(int32_t x, int32_t y) { return x / y; } static inline int64_t squot64(int64_t x, int64_t y) { return x / y; } static inline int8_t srem8(int8_t x, int8_t y) { return x % y; } static inline int16_t srem16(int16_t x, int16_t y) { return x % y; } static inline int32_t srem32(int32_t x, int32_t y) { return x % y; } static inline int64_t srem64(int64_t x, int64_t y) { return x % y; } static inline int8_t smin8(int8_t x, int8_t y) { return x < y ? x : y; } static inline int16_t smin16(int16_t x, int16_t y) { return x < y ? x : y; } static inline int32_t smin32(int32_t x, int32_t y) { return x < y ? x : y; } static inline int64_t smin64(int64_t x, int64_t y) { return x < y ? x : y; } static inline uint8_t umin8(uint8_t x, uint8_t y) { return x < y ? x : y; } static inline uint16_t umin16(uint16_t x, uint16_t y) { return x < y ? x : y; } static inline uint32_t umin32(uint32_t x, uint32_t y) { return x < y ? x : y; } static inline uint64_t umin64(uint64_t x, uint64_t y) { return x < y ? x : y; } static inline int8_t smax8(int8_t x, int8_t y) { return x < y ? y : x; } static inline int16_t smax16(int16_t x, int16_t y) { return x < y ? y : x; } static inline int32_t smax32(int32_t x, int32_t y) { return x < y ? y : x; } static inline int64_t smax64(int64_t x, int64_t y) { return x < y ? y : x; } static inline uint8_t umax8(uint8_t x, uint8_t y) { return x < y ? y : x; } static inline uint16_t umax16(uint16_t x, uint16_t y) { return x < y ? y : x; } static inline uint32_t umax32(uint32_t x, uint32_t y) { return x < y ? y : x; } static inline uint64_t umax64(uint64_t x, uint64_t y) { return x < y ? y : x; } static inline uint8_t shl8(uint8_t x, uint8_t y) { return x << y; } static inline uint16_t shl16(uint16_t x, uint16_t y) { return x << y; } static inline uint32_t shl32(uint32_t x, uint32_t y) { return x << y; } static inline uint64_t shl64(uint64_t x, uint64_t y) { return x << y; } static inline uint8_t lshr8(uint8_t x, uint8_t y) { return x >> y; } static inline uint16_t lshr16(uint16_t x, uint16_t y) { return x >> y; } static inline uint32_t lshr32(uint32_t x, uint32_t y) { return x >> y; } static inline uint64_t lshr64(uint64_t x, uint64_t y) { return x >> y; } static inline int8_t ashr8(int8_t x, int8_t y) { return x >> y; } static inline int16_t ashr16(int16_t x, int16_t y) { return x >> y; } static inline int32_t ashr32(int32_t x, int32_t y) { return x >> y; } static inline int64_t ashr64(int64_t x, int64_t y) { return x >> y; } static inline uint8_t and8(uint8_t x, uint8_t y) { return x & y; } static inline uint16_t and16(uint16_t x, uint16_t y) { return x & y; } static inline uint32_t and32(uint32_t x, uint32_t y) { return x & y; } static inline uint64_t and64(uint64_t x, uint64_t y) { return x & y; } static inline uint8_t or8(uint8_t x, uint8_t y) { return x | y; } static inline uint16_t or16(uint16_t x, uint16_t y) { return x | y; } static inline uint32_t or32(uint32_t x, uint32_t y) { return x | y; } static inline uint64_t or64(uint64_t x, uint64_t y) { return x | y; } static inline uint8_t xor8(uint8_t x, uint8_t y) { return x ^ y; } static inline uint16_t xor16(uint16_t x, uint16_t y) { return x ^ y; } static inline uint32_t xor32(uint32_t x, uint32_t y) { return x ^ y; } static inline uint64_t xor64(uint64_t x, uint64_t y) { return x ^ y; } static inline char ult8(uint8_t x, uint8_t y) { return x < y; } static inline char ult16(uint16_t x, uint16_t y) { return x < y; } static inline char ult32(uint32_t x, uint32_t y) { return x < y; } static inline char ult64(uint64_t x, uint64_t y) { return x < y; } static inline char ule8(uint8_t x, uint8_t y) { return x <= y; } static inline char ule16(uint16_t x, uint16_t y) { return x <= y; } static inline char ule32(uint32_t x, uint32_t y) { return x <= y; } static inline char ule64(uint64_t x, uint64_t y) { return x <= y; } static inline char slt8(int8_t x, int8_t y) { return x < y; } static inline char slt16(int16_t x, int16_t y) { return x < y; } static inline char slt32(int32_t x, int32_t y) { return x < y; } static inline char slt64(int64_t x, int64_t y) { return x < y; } static inline char sle8(int8_t x, int8_t y) { return x <= y; } static inline char sle16(int16_t x, int16_t y) { return x <= y; } static inline char sle32(int32_t x, int32_t y) { return x <= y; } static inline char sle64(int64_t x, int64_t y) { return x <= y; } static inline int8_t pow8(int8_t x, int8_t y) { int8_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline int16_t pow16(int16_t x, int16_t y) { int16_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline int32_t pow32(int32_t x, int32_t y) { int32_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline int64_t pow64(int64_t x, int64_t y) { int64_t res = 1, rem = y; while (rem != 0) { if (rem & 1) res *= x; rem >>= 1; x *= x; } return res; } static inline bool itob_i8_bool(int8_t x) { return x; } static inline bool itob_i16_bool(int16_t x) { return x; } static inline bool itob_i32_bool(int32_t x) { return x; } static inline bool itob_i64_bool(int64_t x) { return x; } static inline int8_t btoi_bool_i8(bool x) { return x; } static inline int16_t btoi_bool_i16(bool x) { return x; } static inline int32_t btoi_bool_i32(bool x) { return x; } static inline int64_t btoi_bool_i64(bool x) { return x; } #define sext_i8_i8(x) ((int8_t) (int8_t) x) #define sext_i8_i16(x) ((int16_t) (int8_t) x) #define sext_i8_i32(x) ((int32_t) (int8_t) x) #define sext_i8_i64(x) ((int64_t) (int8_t) x) #define sext_i16_i8(x) ((int8_t) (int16_t) x) #define sext_i16_i16(x) ((int16_t) (int16_t) x) #define sext_i16_i32(x) ((int32_t) (int16_t) x) #define sext_i16_i64(x) ((int64_t) (int16_t) x) #define sext_i32_i8(x) ((int8_t) (int32_t) x) #define sext_i32_i16(x) ((int16_t) (int32_t) x) #define sext_i32_i32(x) ((int32_t) (int32_t) x) #define sext_i32_i64(x) ((int64_t) (int32_t) x) #define sext_i64_i8(x) ((int8_t) (int64_t) x) #define sext_i64_i16(x) ((int16_t) (int64_t) x) #define sext_i64_i32(x) ((int32_t) (int64_t) x) #define sext_i64_i64(x) ((int64_t) (int64_t) x) #define zext_i8_i8(x) ((uint8_t) (uint8_t) x) #define zext_i8_i16(x) ((uint16_t) (uint8_t) x) #define zext_i8_i32(x) ((uint32_t) (uint8_t) x) #define zext_i8_i64(x) ((uint64_t) (uint8_t) x) #define zext_i16_i8(x) ((uint8_t) (uint16_t) x) #define zext_i16_i16(x) ((uint16_t) (uint16_t) x) #define zext_i16_i32(x) ((uint32_t) (uint16_t) x) #define zext_i16_i64(x) ((uint64_t) (uint16_t) x) #define zext_i32_i8(x) ((uint8_t) (uint32_t) x) #define zext_i32_i16(x) ((uint16_t) (uint32_t) x) #define zext_i32_i32(x) ((uint32_t) (uint32_t) x) #define zext_i32_i64(x) ((uint64_t) (uint32_t) x) #define zext_i64_i8(x) ((uint8_t) (uint64_t) x) #define zext_i64_i16(x) ((uint16_t) (uint64_t) x) #define zext_i64_i32(x) ((uint32_t) (uint64_t) x) #define zext_i64_i64(x) ((uint64_t) (uint64_t) x) static inline float fdiv32(float x, float y) { return x / y; } static inline float fadd32(float x, float y) { return x + y; } static inline float fsub32(float x, float y) { return x - y; } static inline float fmul32(float x, float y) { return x * y; } static inline float fmin32(float x, float y) { return x < y ? x : y; } static inline float fmax32(float x, float y) { return x < y ? y : x; } static inline float fpow32(float x, float y) { return pow(x, y); } static inline char cmplt32(float x, float y) { return x < y; } static inline char cmple32(float x, float y) { return x <= y; } static inline float sitofp_i8_f32(int8_t x) { return x; } static inline float sitofp_i16_f32(int16_t x) { return x; } static inline float sitofp_i32_f32(int32_t x) { return x; } static inline float sitofp_i64_f32(int64_t x) { return x; } static inline float uitofp_i8_f32(uint8_t x) { return x; } static inline float uitofp_i16_f32(uint16_t x) { return x; } static inline float uitofp_i32_f32(uint32_t x) { return x; } static inline float uitofp_i64_f32(uint64_t x) { return x; } static inline int8_t fptosi_f32_i8(float x) { return x; } static inline int16_t fptosi_f32_i16(float x) { return x; } static inline int32_t fptosi_f32_i32(float x) { return x; } static inline int64_t fptosi_f32_i64(float x) { return x; } static inline uint8_t fptoui_f32_i8(float x) { return x; } static inline uint16_t fptoui_f32_i16(float x) { return x; } static inline uint32_t fptoui_f32_i32(float x) { return x; } static inline uint64_t fptoui_f32_i64(float x) { return x; } static inline float futrts_log32(float x) { return log(x); } static inline float futrts_log2_32(float x) { return log2(x); } static inline float futrts_log10_32(float x) { return log10(x); } static inline float futrts_sqrt32(float x) { return sqrt(x); } static inline float futrts_exp32(float x) { return exp(x); } static inline float futrts_cos32(float x) { return cos(x); } static inline float futrts_sin32(float x) { return sin(x); } static inline float futrts_tan32(float x) { return tan(x); } static inline float futrts_acos32(float x) { return acos(x); } static inline float futrts_asin32(float x) { return asin(x); } static inline float futrts_atan32(float x) { return atan(x); } static inline float futrts_atan2_32(float x, float y) { return atan2(x, y); } static inline float futrts_gamma32(float x) { return tgamma(x); } static inline float futrts_lgamma32(float x) { return lgamma(x); } static inline float futrts_round32(float x) { return rint(x); } static inline char futrts_isnan32(float x) { return isnan(x); } static inline char futrts_isinf32(float x) { return isinf(x); } static inline int32_t futrts_to_bits32(float x) { union { float f; int32_t t; } p; p.f = x; return p.t; } static inline float futrts_from_bits32(int32_t x) { union { int32_t f; float t; } p; p.f = x; return p.t; } __kernel void segred_nonseg_3665(__global unsigned char *mem_3672, __global unsigned char *counter_mem_3676, __global unsigned char *group_res_arr_mem_3678, int32_t num_threads_3680) { const int32_t segred_group_sizze_3649 = mainzisegred_group_sizze_3648; const int32_t num_groups_3661 = sext_i64_i32(smax64(1, smin64(sext_i32_i64(mainzisegred_max_num_groups_3651), squot64(1000 + (sext_i32_i64(mainzisegred_group_sizze_3648) - 1), sext_i32_i64(mainzisegred_group_sizze_3648))))); const int block_dim0 = 0; const int block_dim1 = 1; const int block_dim2 = 2; ALIGNED_LOCAL_MEMORY(sync_arr_mem_3686_backing_0, 1); ALIGNED_LOCAL_MEMORY(red_arr_mem_3688_backing_1, 8 * mainzisegred_group_sizze_3648); int32_t global_tid_3681; int32_t local_tid_3682; int32_t group_sizze_3685; int32_t wave_sizze_3684; int32_t group_tid_3683; global_tid_3681 = get_global_id(0); local_tid_3682 = get_local_id(0); group_sizze_3685 = get_local_size(0); wave_sizze_3684 = LOCKSTEP_WIDTH; group_tid_3683 = get_group_id(0); int32_t phys_tid_3665 = global_tid_3681; __local char *sync_arr_mem_3686; sync_arr_mem_3686 = (__local char *) sync_arr_mem_3686_backing_0; __local char *red_arr_mem_3688; red_arr_mem_3688 = (__local char *) red_arr_mem_3688_backing_1; int32_t dummy_3663 = 0; int32_t gtid_3664; gtid_3664 = 0; int64_t x_acc_3690; int32_t chunk_sizze_3691 = smin32(squot32(1000 + segred_group_sizze_3649 * num_groups_3661 - 1, segred_group_sizze_3649 * num_groups_3661), squot32(1000 - phys_tid_3665 + num_threads_3680 - 1, num_threads_3680)); int64_t x_3642; int64_t x_3643; // neutral-initialise the accumulators { x_acc_3690 = 0; } for (int32_t i_3695 = 0; i_3695 < chunk_sizze_3691; i_3695++) { gtid_3664 = phys_tid_3665 + num_threads_3680 * i_3695; // apply map function { int64_t binop_x_3666 = sext_i32_i64(gtid_3664); int64_t index_primexp_3667 = 1 + binop_x_3666; // save map-out results { } // load accumulator { x_3642 = x_acc_3690; } // load new values { x_3643 = index_primexp_3667; } // apply reduction operator { int64_t res_3644 = x_3642 + x_3643; // store in accumulator { x_acc_3690 = res_3644; } } } } // to reduce current chunk, first store our result in memory { x_3642 = x_acc_3690; ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3642; } barrier(CLK_LOCAL_MEM_FENCE); int32_t offset_3696; int32_t skip_waves_3697; int64_t x_3692; int64_t x_3693; offset_3696 = 0; // participating threads read initial accumulator { if (slt32(local_tid_3682, segred_group_sizze_3649)) { x_3692 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 + offset_3696]; } } offset_3696 = 1; while (slt32(offset_3696, wave_sizze_3684)) { if (slt32(local_tid_3682 + offset_3696, segred_group_sizze_3649) && ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) * wave_sizze_3684) & (2 * offset_3696 - 1)) == 0) { // read array element { x_3693 = ((volatile __local int64_t *) red_arr_mem_3688)[local_tid_3682 + offset_3696]; } // apply reduction operation { int64_t res_3694 = x_3692 + x_3693; x_3692 = res_3694; } // write result of operation { ((volatile __local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692; } } offset_3696 *= 2; } skip_waves_3697 = 1; while (slt32(skip_waves_3697, squot32(segred_group_sizze_3649 + wave_sizze_3684 - 1, wave_sizze_3684))) { barrier(CLK_LOCAL_MEM_FENCE); offset_3696 = skip_waves_3697 * wave_sizze_3684; if (slt32(local_tid_3682 + offset_3696, segred_group_sizze_3649) && ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) * wave_sizze_3684) == 0 && (squot32(local_tid_3682, wave_sizze_3684) & (2 * skip_waves_3697 - 1)) == 0)) { // read array element { x_3693 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 + offset_3696]; } // apply reduction operation { int64_t res_3694 = x_3692 + x_3693; x_3692 = res_3694; } // write result of operation { ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692; } } skip_waves_3697 *= 2; } barrier(CLK_LOCAL_MEM_FENCE); // first thread saves the result in accumulator { if (local_tid_3682 == 0) { x_acc_3690 = x_3692; } } int32_t old_counter_3698; // first thread in group saves group result to global memory { if (local_tid_3682 == 0) { ((__global int64_t *) group_res_arr_mem_3678)[group_tid_3683 * segred_group_sizze_3649] = x_acc_3690; mem_fence_global(); old_counter_3698 = atomic_add(&((volatile __global int *) counter_mem_3676)[0], (int) 1); ((__local bool *) sync_arr_mem_3686)[0] = old_counter_3698 == num_groups_3661 - 1; } } barrier(CLK_LOCAL_MEM_FENCE); bool is_last_group_3699 = ((__local bool *) sync_arr_mem_3686)[0]; if (is_last_group_3699) { if (local_tid_3682 == 0) { old_counter_3698 = atomic_add(&((volatile __global int *) counter_mem_3676)[0], (int) (0 - num_groups_3661)); } // read in the per-group-results { if (slt32(local_tid_3682, num_groups_3661)) { x_3642 = ((__global int64_t *) group_res_arr_mem_3678)[local_tid_3682 * segred_group_sizze_3649]; } else { x_3642 = 0; } ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3642; } barrier(CLK_LOCAL_MEM_FENCE); // reduce the per-group results { int32_t offset_3700; int32_t skip_waves_3701; int64_t x_3692; int64_t x_3693; offset_3700 = 0; // participating threads read initial accumulator { if (slt32(local_tid_3682, segred_group_sizze_3649)) { x_3692 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 + offset_3700]; } } offset_3700 = 1; while (slt32(offset_3700, wave_sizze_3684)) { if (slt32(local_tid_3682 + offset_3700, segred_group_sizze_3649) && ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) * wave_sizze_3684) & (2 * offset_3700 - 1)) == 0) { // read array element { x_3693 = ((volatile __local int64_t *) red_arr_mem_3688)[local_tid_3682 + offset_3700]; } // apply reduction operation { int64_t res_3694 = x_3692 + x_3693; x_3692 = res_3694; } // write result of operation { ((volatile __local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692; } } offset_3700 *= 2; } skip_waves_3701 = 1; while (slt32(skip_waves_3701, squot32(segred_group_sizze_3649 + wave_sizze_3684 - 1, wave_sizze_3684))) { barrier(CLK_LOCAL_MEM_FENCE); offset_3700 = skip_waves_3701 * wave_sizze_3684; if (slt32(local_tid_3682 + offset_3700, segred_group_sizze_3649) && ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) * wave_sizze_3684) == 0 && (squot32(local_tid_3682, wave_sizze_3684) & (2 * skip_waves_3701 - 1)) == 0)) { // read array element { x_3693 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 + offset_3700]; } // apply reduction operation { int64_t res_3694 = x_3692 + x_3693; x_3692 = res_3694; } // write result of operation { ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692; } } skip_waves_3701 *= 2; } // and back to memory with the final result { if (local_tid_3682 == 0) { ((__global int64_t *) mem_3672)[0] = x_3692; } } } } } ```
athas commented 5 years ago

None of that looks particularly dubious. The fact that it also fails with --default-num-groups=1 is a big red flag, because with that configuration there is no cross-group communication, which is the only semi-dubious thing that Futhark does.

Could you try with --default-num-groups=1 --default-group-size=1? This will use only a single GPU thread (so pick a small constant). If this also fails, then it must be a pocl bug.

owickstrom commented 5 years ago

That worked!

athas commented 5 years ago

Drat, that makes the smoking gun less obvious. I still suspect it's a pocl bug when it fails with --default-num-groups=1. If it's a subtle memory coherency issue, then it also makes no sense that using a constant work size would matter. Unfortunately, I'm having some trouble getting pocl installed, but I will take a look eventually.

owickstrom commented 5 years ago

OK. If you come up with any else I should try, or things you need from my environment, let me know.

athas commented 5 years ago

As a workaround, you can probably use avg (opaque 1000) to hide the constant from the compiler.

owickstrom commented 5 years ago

Indeed, that works.

athas commented 3 years ago

Since this has not come up anywhere but pocl, and Futhark-on-CPU is better served by the soon finished multicore backend, I am closing this issue.