Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Captured variables stored twice into task #44017

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR45047
Status NEW
Importance P enhancement
Reported by Johannes Doerfert (jdoerfert@anl.gov)
Reported on 2020-02-27 09:05:26 -0800
Last modified on 2020-03-02 09:22:59 -0800
Version unspecified
Hardware PC Linux
CC andrey.churbanov@intel.com, jonathan.l.peyton@intel.com, llvm-bugs@lists.llvm.org, michael.klemm@amd.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

In the following simple example we store A twice into the task structure (https://godbolt.org/z/Wq9xx6). The second time is especially problematic since we use the hardcoded 40 byte offset from the task_alloc return instead of loading the shared pointer value (first in the task struct).

void foo(int *A) {

pragma omp task

{
    *A = 0;
}

}

Quuxplusone commented 4 years ago
Sorry if I misread the IR (still newbie on that), but what I see on Godbolt is:
pointer A is written to task structure once - to privates area (at offset 40).

And it is read from that location inside the outlined task routine.

What is unclear to me - why stack address (where pointer A is also written) is
put into the shareds area of the task structure? This address is anyway going
to be stale during task execution... So to me the first store into the task
structure looks suspicious.

BTW, according to OpenMP spec the pointer A should be firstprivate, so keeping
it in privates looks reasonable.
Quuxplusone commented 4 years ago
The problem IMHO is that the offset used (40) is computed statically instead of
simply following the "shareds" pointer in kmp_task_t.  So, the proper way of
referring to the storage location would be to have assignments like this:

task = __kmpc_omp_task_alloc(...)
*(void*)(((char*)task->shared) + 0) = ptr;
*(double*)(((char*)task->shared) + 8) = some_dp_value;
*(int*)(((char*)task->shared) + 16) = some_other_value;

Right now, the compiler seems to do this:

task = __kmpc_omp_task_alloc(...)
*(void*)(((char*)task) + 40) = ptr;
*(double*)(((char*)task) + 48) = some_dp_value;
*(int*)(((char*)task) + 56) = some_other_value;

Which I you think about a runtime that tries to be compatible requires you to
do the same trick as the LLVM OpenMP runtime and pre-prend the kmp_taskdata_t
struct /before/ kmp_task_t, instead of simply allowing to extend the kmp_task_t
structure.
Quuxplusone commented 4 years ago
I'm not 100% sure about my analysis, but I think there's another issue
associated with this bug report.  It also seems that the compiler not only
writes the task memory twice, it also causes some extra memory consumption.
When the generated code invokes __kmpc_omp_task_alloc, it seems to pass wrong
byte counts:

__kmpc_omp_task_alloc: sizeof_kmp_task_t=48 sizeof_shareds=8

This should be:

__kmpc_omp_task_alloc: sizeof_kmp_task_t=40 sizeof_shareds=8

as sizeof(kmp_task_t) is 40.  With tasks that receive more data, it seems that
clang incorrectly adds the size of the shareds space to the size of the
kmp_task_t structure.  In the runtime, the allocation size is computed using
sizeof_kmp_task_t+sizeof_shareds, so some extra memory is allocated that is not
needed.

In addition, the way the duplicated writes are done store the duplicated values
at task+0x28 and task->shareds:

0x400879  callq  __kmpc_omp_task_alloc
0x40087e  mov    (%rax),%rcx       # load "task->shareds" (address: task+48b)
0x400881  mov    -0x10(%rbp),%rdi  # load 'd' from stack
0x400885  mov    %rdi,(%rcx)       # write to "*(double*)task->shareds"
0x400888  movsd  (%rcx),%xmm0      # load 'd' into xmm0
0x40088c  movsd  %xmm0,0x28(%rax)  # write 'd' at task+40b

This only works correctly, because the OpenMP runtime allocates the unneeded,
extra amount (see kmp_tasking.cc:1229-1245):

1227 // Calculate shared structure offset including padding after kmp_task_t
struct
1228 // to align pointers in shared struct
1229 shareds_offset = sizeof(kmp_taskdata_t) + sizeof_kmp_task_t;
1230 shareds_offset = __kmp_round_up_to_val(shareds_offset, sizeof(void *));
[...]
1240   taskdata = (kmp_taskdata_t *)__kmp_fast_allocate(thread, shareds_offset +
1241
sizeof_shareds);
[...]
1243   taskdata = (kmp_taskdata_t *)__kmp_thread_malloc(thread, shareds_offset +
1244
sizeof_shareds);

Reproducer:

// compile with: clang -fopenmp -o reproducer reproducer.c
#include <stdio.h>
#include <omp.h>

void create_task(double d) {
#pragma omp task firstprivate(d)
    {
        printf("d=%lf\n", d);
    }
}

int main(int argc, char *argv[]) {
    double d = 42.0;
#pragma omp parallel
    {
        if (omp_get_thread_num() == 0) {
            create_task(d);
        }
    }
    return 0;
}
Quuxplusone commented 4 years ago
(In reply to Michael Klemm from comment #3)
> When the generated code invokes __kmpc_omp_task_alloc, it seems to pass
> wrong byte counts:
>
> __kmpc_omp_task_alloc: sizeof_kmp_task_t=48 sizeof_shareds=8
>
> This should be:
>
> __kmpc_omp_task_alloc: sizeof_kmp_task_t=40 sizeof_shareds=8
>
> as sizeof(kmp_task_t) is 40.
I'd say it should better be
 __kmpc_omp_task_alloc: sizeof_kmp_task_t=48 sizeof_shareds=0

Because firstprivate is written to kmp_task_t_with_privates (increasing its
size from 40 to 48 bytes), while sample code does not have any shareds to
propagate to the task.

I indeed misread the IR, and looking at the assembler generated for Michael's
example I see the value of the variable is written twice (not value and address
as I suspected initially).  So Johannes' investigation looks correct.

Though I still think the write to shareds may be treated as "wrong", while the
write to privates is correct.
Quuxplusone commented 4 years ago
There is two problems here:
  1) We capture it twice, which will be gone with the OpenMPIRBuilder code path.
  2) We hard coded the offset into the IR which tightly couples the IR to the specific runtime implementation. The proper way is describe by Michael.