Closed owickstrom closed 3 years ago
Strange, I cannot reproduce locally and the intermediate code looks fine. Does it also fail when using futhark c
? Does it fail with other types than i64
? Does the program map (+1) (1...1000)
(with i64
s) also produce bogus results? What I am suspecting is that this is actually a driver issue, since nothing that Futhark does that could cause race conditions has anything to do with the element size. However, I don't think parallel processing 64-bit integers are particularly prevalent in most programs, so it is certainly possible a Futhark compiler bug is lurking.
I cannot find any information on what driver "ocl-icp" may be. Do you have more information?
What platform does clinfo
report?
Thanks for the quick feedback!
futhark c
I get correct resultsmap (+1) (1...1000)
with futhark opencl
works as expectedSorry for the typo, should be ocl-icd
. Here's the package manager info:
Here's the output of clinfo
:
Gotta go to work, but I'll get back to you ASAP if you need more info.
Maybe it's something in pocl
?
Oh, pocl. I never did manage to get that to work - last time I tried, it just segfaulted. I'll see if I can build it and try again.
Futhark does actually do a slightly naughty thing with reductions to communicate between different groups, but it's done in a way that I'm surprised wouldn't work (and has worked fine on other CPU devices, too).
Although I cannot possibly imagine why a communications issue would only take effect when the input size is known. That is very mysterious.
Does it also produce the wrong result for other constants?
Does it also produce the wrong result for other constants?
Yes, same thing.
Could you try passing the option --default-num-groups=1
to the program?
Oh, and does it also fail for large constants, say, one billion?
Could you try passing the option --default-num-groups=1 to the program?
No change.
Oh, and does it also fail for large constants, say, one billion?
Yeah, same thing.
I tried dumping the OpenCL code out from the different versions. There are a few extra arguments passed to a kernel prefixed with segred_nonseg_
in the dynamic version, but I couldn't find anything that was immediately suspicious. Maybe you want to take a look. Full files are listed below the diff.
diff avg.cl avg.1000.cl
:
724,729c724,727
< __kernel void segred_nonseg_3694(int32_t num_elems_3668,
< int32_t num_groups_3690, __global
< unsigned char *mem_3701, __global
< unsigned char *counter_mem_3705, __global
< unsigned char *group_res_arr_mem_3707,
< int32_t num_threads_3709)
---
> __kernel void segred_nonseg_3665(__global unsigned char *mem_3672, __global
> unsigned char *counter_mem_3676, __global
> unsigned char *group_res_arr_mem_3678,
> int32_t num_threads_3680)
731c729,735
< const int32_t segred_group_sizze_3678 = mainzisegred_group_sizze_3677;
---
> const int32_t segred_group_sizze_3649 = mainzisegred_group_sizze_3648;
> const int32_t num_groups_3661 = sext_i64_i32(smax64(1,
> smin64(sext_i32_i64(mainzisegred_max_num_groups_3651),
> squot64(1000 +
> (sext_i32_i64(mainzisegred_group_sizze_3648) -
> 1),
> sext_i32_i64(mainzisegred_group_sizze_3648)))));
736,776c740,780
< ALIGNED_LOCAL_MEMORY(sync_arr_mem_3715_backing_0, 1);
< ALIGNED_LOCAL_MEMORY(red_arr_mem_3717_backing_1, 8 *
< mainzisegred_group_sizze_3677);
<
< int32_t global_tid_3710;
< int32_t local_tid_3711;
< int32_t group_sizze_3714;
< int32_t wave_sizze_3713;
< int32_t group_tid_3712;
<
< global_tid_3710 = get_global_id(0);
< local_tid_3711 = get_local_id(0);
< group_sizze_3714 = get_local_size(0);
< wave_sizze_3713 = LOCKSTEP_WIDTH;
< group_tid_3712 = get_group_id(0);
<
< int32_t phys_tid_3694 = global_tid_3710;
< __local char *sync_arr_mem_3715;
<
< sync_arr_mem_3715 = (__local char *) sync_arr_mem_3715_backing_0;
<
< __local char *red_arr_mem_3717;
<
< red_arr_mem_3717 = (__local char *) red_arr_mem_3717_backing_1;
<
< int32_t dummy_3692 = 0;
< int32_t gtid_3693;
<
< gtid_3693 = 0;
<
< int64_t x_acc_3719;
< int32_t chunk_sizze_3720 = smin32(squot32(num_elems_3668 +
< segred_group_sizze_3678 *
< num_groups_3690 - 1,
< segred_group_sizze_3678 *
< num_groups_3690),
< squot32(num_elems_3668 - phys_tid_3694 +
< num_threads_3709 - 1,
< num_threads_3709));
< int64_t x_3671;
< int64_t x_3672;
---
> ALIGNED_LOCAL_MEMORY(sync_arr_mem_3686_backing_0, 1);
> ALIGNED_LOCAL_MEMORY(red_arr_mem_3688_backing_1, 8 *
> mainzisegred_group_sizze_3648);
>
> int32_t global_tid_3681;
> int32_t local_tid_3682;
> int32_t group_sizze_3685;
> int32_t wave_sizze_3684;
> int32_t group_tid_3683;
>
> global_tid_3681 = get_global_id(0);
> local_tid_3682 = get_local_id(0);
> group_sizze_3685 = get_local_size(0);
> wave_sizze_3684 = LOCKSTEP_WIDTH;
> group_tid_3683 = get_group_id(0);
>
> int32_t phys_tid_3665 = global_tid_3681;
> __local char *sync_arr_mem_3686;
>
> sync_arr_mem_3686 = (__local char *) sync_arr_mem_3686_backing_0;
>
> __local char *red_arr_mem_3688;
>
> red_arr_mem_3688 = (__local char *) red_arr_mem_3688_backing_1;
>
> int32_t dummy_3663 = 0;
> int32_t gtid_3664;
>
> gtid_3664 = 0;
>
> int64_t x_acc_3690;
> int32_t chunk_sizze_3691 = smin32(squot32(1000 + segred_group_sizze_3649 *
> num_groups_3661 - 1,
> segred_group_sizze_3649 *
> num_groups_3661), squot32(1000 -
> phys_tid_3665 +
> num_threads_3680 -
> 1,
> num_threads_3680));
> int64_t x_3642;
> int64_t x_3643;
780c784
< x_acc_3719 = 0;
---
> x_acc_3690 = 0;
782,783c786,787
< for (int32_t i_3724 = 0; i_3724 < chunk_sizze_3720; i_3724++) {
< gtid_3693 = phys_tid_3694 + num_threads_3709 * i_3724;
---
> for (int32_t i_3695 = 0; i_3695 < chunk_sizze_3691; i_3695++) {
> gtid_3664 = phys_tid_3665 + num_threads_3680 * i_3695;
786,787c790,791
< int64_t binop_x_3695 = sext_i32_i64(gtid_3693);
< int64_t index_primexp_3696 = 1 + binop_x_3695;
---
> int64_t binop_x_3666 = sext_i32_i64(gtid_3664);
> int64_t index_primexp_3667 = 1 + binop_x_3666;
793c797
< x_3671 = x_acc_3719;
---
> x_3642 = x_acc_3690;
797c801
< x_3672 = index_primexp_3696;
---
> x_3643 = index_primexp_3667;
801c805
< int64_t res_3673 = x_3671 + x_3672;
---
> int64_t res_3644 = x_3642 + x_3643;
805c809
< x_acc_3719 = res_3673;
---
> x_acc_3690 = res_3644;
812,813c816,817
< x_3671 = x_acc_3719;
< ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3671;
---
> x_3642 = x_acc_3690;
> ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3642;
817,820c821,824
< int32_t offset_3725;
< int32_t skip_waves_3726;
< int64_t x_3721;
< int64_t x_3722;
---
> int32_t offset_3696;
> int32_t skip_waves_3697;
> int64_t x_3692;
> int64_t x_3693;
822c826
< offset_3725 = 0;
---
> offset_3696 = 0;
825,827c829,831
< if (slt32(local_tid_3711, segred_group_sizze_3678)) {
< x_3721 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 +
< offset_3725];
---
> if (slt32(local_tid_3682, segred_group_sizze_3649)) {
> x_3692 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 +
> offset_3696];
830,834c834,838
< offset_3725 = 1;
< while (slt32(offset_3725, wave_sizze_3713)) {
< if (slt32(local_tid_3711 + offset_3725, segred_group_sizze_3678) &&
< ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) *
< wave_sizze_3713) & (2 * offset_3725 - 1)) == 0) {
---
> offset_3696 = 1;
> while (slt32(offset_3696, wave_sizze_3684)) {
> if (slt32(local_tid_3682 + offset_3696, segred_group_sizze_3649) &&
> ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) *
> wave_sizze_3684) & (2 * offset_3696 - 1)) == 0) {
837,839c841,843
< x_3722 = ((volatile __local
< int64_t *) red_arr_mem_3717)[local_tid_3711 +
< offset_3725];
---
> x_3693 = ((volatile __local
> int64_t *) red_arr_mem_3688)[local_tid_3682 +
> offset_3696];
843c847
< int64_t res_3723 = x_3721 + x_3722;
---
> int64_t res_3694 = x_3692 + x_3693;
845c849
< x_3721 = res_3723;
---
> x_3692 = res_3694;
850c854
< int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721;
---
> int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692;
853c857
< offset_3725 *= 2;
---
> offset_3696 *= 2;
855,858c859,862
< skip_waves_3726 = 1;
< while (slt32(skip_waves_3726, squot32(segred_group_sizze_3678 +
< wave_sizze_3713 - 1,
< wave_sizze_3713))) {
---
> skip_waves_3697 = 1;
> while (slt32(skip_waves_3697, squot32(segred_group_sizze_3649 +
> wave_sizze_3684 - 1,
> wave_sizze_3684))) {
860,865c864,869
< offset_3725 = skip_waves_3726 * wave_sizze_3713;
< if (slt32(local_tid_3711 + offset_3725, segred_group_sizze_3678) &&
< ((local_tid_3711 - squot32(local_tid_3711, wave_sizze_3713) *
< wave_sizze_3713) == 0 && (squot32(local_tid_3711,
< wave_sizze_3713) & (2 *
< skip_waves_3726 -
---
> offset_3696 = skip_waves_3697 * wave_sizze_3684;
> if (slt32(local_tid_3682 + offset_3696, segred_group_sizze_3649) &&
> ((local_tid_3682 - squot32(local_tid_3682, wave_sizze_3684) *
> wave_sizze_3684) == 0 && (squot32(local_tid_3682,
> wave_sizze_3684) & (2 *
> skip_waves_3697 -
869,870c873,874
< x_3722 = ((__local int64_t *) red_arr_mem_3717)[local_tid_3711 +
< offset_3725];
---
> x_3693 = ((__local int64_t *) red_arr_mem_3688)[local_tid_3682 +
> offset_3696];
874c878
< int64_t res_3723 = x_3721 + x_3722;
---
> int64_t res_3694 = x_3692 + x_3693;
876c880
< x_3721 = res_3723;
---
> x_3692 = res_3694;
880c884
< ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721;
---
> ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692;
883c887
< skip_waves_3726 *= 2;
---
> skip_waves_3697 *= 2;
888,889c892,893
< if (local_tid_3711 == 0) {
< x_acc_3719 = x_3721;
---
> if (local_tid_3682 == 0) {
> x_acc_3690 = x_3692;
893c897
< int32_t old_counter_3727;
---
> int32_t old_counter_3698;
897,900c901,904
< if (local_tid_3711 == 0) {
< ((__global int64_t *) group_res_arr_mem_3707)[group_tid_3712 *
< segred_group_sizze_3678] =
< x_acc_3719;
---
> if (local_tid_3682 == 0) {
> ((__global int64_t *) group_res_arr_mem_3678)[group_tid_3683 *
> segred_group_sizze_3649] =
> x_acc_3690;
902,903c906,907
< old_counter_3727 = atomic_add(&((volatile __global
< int *) counter_mem_3705)[0],
---
> old_counter_3698 = atomic_add(&((volatile __global
> int *) counter_mem_3676)[0],
905,906c909,910
< ((__local bool *) sync_arr_mem_3715)[0] = old_counter_3727 ==
< num_groups_3690 - 1;
---
> ((__local bool *) sync_arr_mem_3686)[0] = old_counter_3698 ==
> num_groups_3661 - 1;
911c915
< bool is_last_group_3728 = ((__local bool *) sync_arr_mem_3715)[0];
---
> bool is_last_group_3699 = ((__local bool *) sync_arr_mem_3686)[0];
913,917c917,921
< if (is_last_group_3728) {
< if (local_tid_3711 == 0) {
< old_counter_3727 = atomic_add(&((volatile __global
< int *) counter_mem_3705)[0],
< (int) (0 - num_groups_3690));
---
> if (is_last_group_3699) {
> if (local_tid_3682 == 0) {
> old_counter_3698 = atomic_add(&((volatile __global
> int *) counter_mem_3676)[0],
> (int) (0 - num_groups_3661));
921,924c925,928
< if (slt32(local_tid_3711, num_groups_3690)) {
< x_3671 = ((__global
< int64_t *) group_res_arr_mem_3707)[local_tid_3711 *
< segred_group_sizze_3678];
---
> if (slt32(local_tid_3682, num_groups_3661)) {
> x_3642 = ((__global
> int64_t *) group_res_arr_mem_3678)[local_tid_3682 *
> segred_group_sizze_3649];
926c930
< x_3671 = 0;
---
> x_3642 = 0;
928c932
< ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3671;
---
> ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3642;
933,936c937,940
< int32_t offset_3729;
< int32_t skip_waves_3730;
< int64_t x_3721;
< int64_t x_3722;
---
> int32_t offset_3700;
> int32_t skip_waves_3701;
> int64_t x_3692;
> int64_t x_3693;
938c942
< offset_3729 = 0;
---
> offset_3700 = 0;
941,944c945,948
< if (slt32(local_tid_3711, segred_group_sizze_3678)) {
< x_3721 = ((__local
< int64_t *) red_arr_mem_3717)[local_tid_3711 +
< offset_3729];
---
> if (slt32(local_tid_3682, segred_group_sizze_3649)) {
> x_3692 = ((__local
> int64_t *) red_arr_mem_3688)[local_tid_3682 +
> offset_3700];
947,954c951,958
< offset_3729 = 1;
< while (slt32(offset_3729, wave_sizze_3713)) {
< if (slt32(local_tid_3711 + offset_3729,
< segred_group_sizze_3678) && ((local_tid_3711 -
< squot32(local_tid_3711,
< wave_sizze_3713) *
< wave_sizze_3713) & (2 *
< offset_3729 -
---
> offset_3700 = 1;
> while (slt32(offset_3700, wave_sizze_3684)) {
> if (slt32(local_tid_3682 + offset_3700,
> segred_group_sizze_3649) && ((local_tid_3682 -
> squot32(local_tid_3682,
> wave_sizze_3684) *
> wave_sizze_3684) & (2 *
> offset_3700 -
959,961c963,965
< x_3722 = ((volatile __local
< int64_t *) red_arr_mem_3717)[local_tid_3711 +
< offset_3729];
---
> x_3693 = ((volatile __local
> int64_t *) red_arr_mem_3688)[local_tid_3682 +
> offset_3700];
965c969
< int64_t res_3723 = x_3721 + x_3722;
---
> int64_t res_3694 = x_3692 + x_3693;
967c971
< x_3721 = res_3723;
---
> x_3692 = res_3694;
972c976
< int64_t *) red_arr_mem_3717)[local_tid_3711] = x_3721;
---
> int64_t *) red_arr_mem_3688)[local_tid_3682] = x_3692;
975c979
< offset_3729 *= 2;
---
> offset_3700 *= 2;
977,980c981,984
< skip_waves_3730 = 1;
< while (slt32(skip_waves_3730, squot32(segred_group_sizze_3678 +
< wave_sizze_3713 - 1,
< wave_sizze_3713))) {
---
> skip_waves_3701 = 1;
> while (slt32(skip_waves_3701, squot32(segred_group_sizze_3649 +
> wave_sizze_3684 - 1,
> wave_sizze_3684))) {
982,990c986,994
< offset_3729 = skip_waves_3730 * wave_sizze_3713;
< if (slt32(local_tid_3711 + offset_3729,
< segred_group_sizze_3678) && ((local_tid_3711 -
< squot32(local_tid_3711,
< wave_sizze_3713) *
< wave_sizze_3713) == 0 &&
< (squot32(local_tid_3711,
< wave_sizze_3713) &
< (2 * skip_waves_3730 -
---
> offset_3700 = skip_waves_3701 * wave_sizze_3684;
> if (slt32(local_tid_3682 + offset_3700,
> segred_group_sizze_3649) && ((local_tid_3682 -
> squot32(local_tid_3682,
> wave_sizze_3684) *
> wave_sizze_3684) == 0 &&
> (squot32(local_tid_3682,
> wave_sizze_3684) &
> (2 * skip_waves_3701 -
994,996c998,1000
< x_3722 = ((__local
< int64_t *) red_arr_mem_3717)[local_tid_3711 +
< offset_3729];
---
> x_3693 = ((__local
> int64_t *) red_arr_mem_3688)[local_tid_3682 +
> offset_3700];
1000c1004
< int64_t res_3723 = x_3721 + x_3722;
---
> int64_t res_3694 = x_3692 + x_3693;
1002c1006
< x_3721 = res_3723;
---
> x_3692 = res_3694;
1006,1007c1010,1011
< ((__local int64_t *) red_arr_mem_3717)[local_tid_3711] =
< x_3721;
---
> ((__local int64_t *) red_arr_mem_3688)[local_tid_3682] =
> x_3692;
1010c1014
< skip_waves_3730 *= 2;
---
> skip_waves_3701 *= 2;
1014,1015c1018,1019
< if (local_tid_3711 == 0) {
< ((__global int64_t *) mem_3701)[0] = x_3721;
---
> if (local_tid_3682 == 0) {
> ((__global int64_t *) mem_3672)[0] = x_3692;
None of that looks particularly dubious. The fact that it also fails with --default-num-groups=1
is a big red flag, because with that configuration there is no cross-group communication, which is the only semi-dubious thing that Futhark does.
Could you try with --default-num-groups=1 --default-group-size=1
? This will use only a single GPU thread (so pick a small constant). If this also fails, then it must be a pocl bug.
That worked!
Drat, that makes the smoking gun less obvious. I still suspect it's a pocl bug when it fails with --default-num-groups=1
. If it's a subtle memory coherency issue, then it also makes no sense that using a constant work size would matter. Unfortunately, I'm having some trouble getting pocl installed, but I will take a look eventually.
OK. If you come up with any else I should try, or things you need from my environment, let me know.
As a workaround, you can probably use avg (opaque 1000)
to hide the constant from the compiler.
Indeed, that works.
Since this has not come up anywhere but pocl, and Futhark-on-CPU is better served by the soon finished multicore backend, I am closing this issue.
Hey! First, thanks for working on such a nice language. I'd love to use this instead of writing OpenCL C and host code by hand.
Problem
The following
avg.fut
program, usingreduce
,and when provided with numbers on stdin, produces correct results:
However, if I accept no argument in
main
, and hard-coden
to1000
,it returns incorrect and non-deterministic results:
I've tried messing around with
--default-num-groups
, thinking that it could be related to #252, but it didn't help.Setup
Compiled by latest binary release and the following command:
OS: Fedora 30 Packages: system
ocl-icp
andopencl-headers
Device: pthread-Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz (as reported byclinfo
)Thankful for any help!