OpenMP Synch_p2p depends on x86

jeffhammond commented 2 years ago

Nondeterministic failures in OpenMP Synch_p2p on Graviton 3 suggest the code depends on x86 memory model behavior and needs fixing.

[jhammond@c7g-dy-c7g16xlarge-1 Synch_p2p]$ ./p2p 8 10 8000 8000
Parallel Research Kernels version 2.17
OpenMP pipeline execution on 2D grid
Number of threads         = 8
Grid sizes                = 8000, 8000
Number of iterations      = 10
Neighbor thread handshake = on
Solution validates
Rate (MFlops/s): 13230.540174 Avg time (s): 0.009672
[jhammond@c7g-dy-c7g16xlarge-1 Synch_p2p]$ ./p2p 8 10 8000 8000
Parallel Research Kernels version 2.17
OpenMP pipeline execution on 2D grid
Number of threads         = 8
Grid sizes                = 8000, 8000
Number of iterations      = 10
Neighbor thread handshake = on
ERROR: checksum 143982.000000 does not match verification value 175978.000000

This pattern is almost certainly wrong in general (i.e. not x86).

    if (TID==0) { /* first thread waits for corner value to be copied            */
      while (flag(0,0) == true) {
        #pragma omp flush
      }
#if SYNCHRONOUS
      flag(0,0)= true;
      #pragma omp flush
#endif
    }

jeffhammond commented 2 years ago

@tgmattso if you are looking for an excuse to write examples of OpenMP atomics, this is a good one.

tgmattso commented 2 years ago

Yes, this code is wrong. It works on x86 since we apply relaxed atomics in the x86 memory model. The current code is based on the very old, original OpenMP. We realized we needed a more flexible mechanism built around atomics which we added in OpenMP 4.0. I haven't tested this yet in syncP2P, but the right patter is the following.

    if (TID==0) { /* first thread waits for corner value to be copied            */
      while (1) {
        #pragma omp atomic read seq_cst
            flg_tmp = flag(0,0);
        if (flg_tmp == true)  break;
      }
#if SYNCHRONOUS
      #pragma omp atomic write seq_cst
         flag(0,0)= true;
#endif
    }

tgmattso commented 2 years ago

sorry, I accidently marked this as closed. It's not closed until I test and verify the code.

rfvander commented 2 years ago

Geez, who wrote that shitty code?

tgmattso commented 2 years ago

I don't know. :)

But I take considerable blame for this since for many years, that is how I told people to handle point to point synchronization in OpenMP. The real problem is the OpenMP specification prior to OpenMP 4.0. Those of us creating it didn't know better and created an API with insufficient atomics to write such code. Shame on them.

rfvander commented 2 years ago

What's funny is that as I was implementing the synch_p2p code I looked for examples of the type of synchronization needed and landed on the LU NAS Parallel Benchmark. I soon discovered it was wrong and had been wrong for years. I was very proud of my discovery and made sure not to repeat the mistake with synch_p2p. Or so I thought ...

jeffhammond commented 1 year ago

i finally got around to fixing this. hopefully at least one of you has a chance to review it.

ParRes / Kernels

OpenMP Synch_p2p depends on x86 #611