Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Incorrect result with OMP_TARGET_OFFLOAD=DISABLED #41366

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR42393
Status NEW
Importance P normal
Reported by Ye Luo (xw111luoye@gmail.com)
Reported on 2019-06-25 14:16:39 -0700
Last modified on 2020-04-06 06:30:34 -0700
Version unspecified
Hardware PC Linux
CC a.bataev@hotmail.com, bruno.turcksin@gmail.com, jdoerfert@anl.gov, llvm-bugs@lists.llvm.org, terry.l.wilmarth@intel.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
#include <iostream>
#include <omp.h>
The attached code gives correct results with OMP_TARGET_OFFLOAD=MANDATORY but
wrong results with OMP_TARGET_OFFLOAD=DISABLED

OMP_NUM_THREADS=2 OMP_TARGET_OFFLOAD=MANDATORY ./a.out
tid = 1
  0  1  2  3
tid = 0
  0  1  2  3

$ OMP_NUM_THREADS=2 OMP_TARGET_OFFLOAD=DISABLED ./a.out
tid = 0
  0  1  0  0
tid = 1
  0  0  2  3

int main()
{
  const int size = 4;
  #pragma omp parallel
  {
    int A[size];
    for(int i = 0; i < size; i++)
      A[i] = 0;

    #pragma omp target teams distribute map(tofrom: A[:size])
    for(int i = 0; i < size; i++)
    {
      A[i] = i;
    }

    #pragma omp critical
    {
      std::cout << "tid = " << omp_get_thread_num() << std::endl;
      for(int i = 0; i < size; i++)
        std::cout << "  " << A[i];
      std::cout << std::endl;
    }
  }
}
Quuxplusone commented 5 years ago

The problem caused by kmpc_push_target_tripcount function. It is not threadsafe and has data race, which causes incorrect results.

Quuxplusone commented 5 years ago
(In reply to Alexey Bataev from comment #1)
> The problem caused by kmpc_push_target_tripcount function. It is not
> threadsafe and has data race, which causes incorrect results.

Also, seems to me, there is a problem with the runtime, will investigate this.
Quuxplusone commented 5 years ago
After some investigation, seems to me it is the problem in libomp. When we
schedule the distribute loop, the tid is taken from the outer threads (though
it should set to 0 in all cases) and the number of threads is taken from the
outer parallel region (though seems to me, it should be set to 1).
It would be good to check if libomp works correctly here.
Quuxplusone commented 4 years ago
I tried compiling this without offload
$clang++ -fopenmp debug.cpp
$ OMP_NUM_THREADS=2 ./a.out
tid = 0
  0  1  0  0
tid = 1
  0  0  2  3
Probably something is wrong with libomp already.
Quuxplusone commented 4 years ago

On the host we generate a omp for like loop for the distribute which binds to the outer parallel. Thus, each thread of the outer parallel executes only a single iteration of the A[i] = i loop instead of the entire thing.

Quuxplusone commented 4 years ago

(In reply to Johannes Doerfert from comment #5)

On the host we generate a omp for like loop for the distribute which binds to the outer parallel. Thus, each thread of the outer parallel executes only a single iteration of the A[i] = i loop instead of the entire thing.

If omp for like loop is generated for distribute, omp parallel like region should be generated for teams. binding to the outer parallel is clearly wrong.

Quuxplusone commented 4 years ago

(In reply to Ye Luo from comment #6)

(In reply to Johannes Doerfert from comment #5)

On the host we generate a omp for like loop for the distribute which binds to the outer parallel. Thus, each thread of the outer parallel executes only a single iteration of the A[i] = i loop instead of the entire thing.

If omp for like loop is generated for distribute, omp parallel like region should be generated for teams. binding to the outer parallel is clearly wrong.

Agreed.