Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Wrong transformation due to semantic gap between C11 and LLVM semantics #22513

Closed Quuxplusone closed 9 years ago

Quuxplusone commented 9 years ago
Bugzilla Link PR22514
Status RESOLVED FIXED
Importance P normal
Reported by soham (sohachak@mpi-sws.org)
Reported on 2015-02-09 04:29:34 -0800
Last modified on 2015-02-10 17:12:11 -0800
Version trunk
Hardware PC Windows NT
CC david.majnemer@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk, sohachak@mpi-sws.org
Fixed by commit(s)
Attachments loadrace.zip (32790 bytes, application/x-zip-compressed)
Blocks
Blocked by
See also
Created attachment 13828
source and IR files

Hi,

The following C++11 source code where readA() and writeA() are running
concurrently is compiled with opt -O3.

Source
----------
atomic<int> x = 0;
int a = 0;
int readA(bool flag) {
 int r=0, r1=0;

 if(flag==true) {
    r = a;
 }
 if(x==1){
   r1 = a;
 }
 else {
  r1 = 42;
}
return (r+r1);
}

void writeA(){
  a = 42;
  x = 1;
}
:

Compilation command
---------------------
clang++ -std=c++11 -emit-llvm -pthread <filename>.cpp -S;opt -O3 <filename>.ll -
o <filename>.opt.bc -S

Target
--------
define i32 @_Z5readAb(i1 zeroext %flag) #3 {
entry:
  %0 = load i32* @a, align 4
  %. = select i1 %flag, i32 %0, i32 0
  %1 = load atomic i32* getelementptr inbounds (%"struct.std::atomic"* @x, i64 0, i32 0, i32 0) seq_cst, align 4
  %cmp1 = icmp eq i32 %1, 1
  %r1.0 = select i1 %cmp1, i32 %0, i32 42
  %add = add nsw i32 %r1.0, %.
  ret i32 %add
}

define void @_Z6writeAv() #3 {
entry:
  store i32 42, i32* @a, align 4
  store atomic i32 1, i32* getelementptr inbounds (%"struct.std::atomic"* @x, i64 0, i32 0, i32 0) seq_cst, align 4
  ret void
}
:

Suppose now we run readA(false) in parallel with writeA().
The source program is data race free and can return only 42.
The target program, however, is racy and could return any value (practically,
it returns 0).

Analysis of the transformation steps
-------------------------------------
(1) The "Simplify CFG" pass introduces a speculative load of 'a' (introducing a
data race).

IR
---
define i32 @_Z5readAb(i1 zeroext %flag) #3 {
entry:
  %0 = load i32* @a, align 4
  %. = select i1 %flag, i32 %0, i32 0
  %call = call i32 @_ZNKSt13__atomic_baseIiEcviEv(%"struct.std::__atomic_base"* getelementptr inbounds (%"struct.std::atomic"* @x, i64 0, i32 0)) #2
  %cmp1 = icmp eq i32 %call, 1
  %1 = load i32* @a, align 4
  %r1.0 = select i1 %cmp1, i32 %1, i32 42
  %add = add nsw i32 %., %r1.0
  ret i32 %add
}

The discussion in https://groups.google.com/forum/#!topic/llvm-dev/5OH6B-nIRyo
and http://llvm.org/docs/Atomics.html#optimization-outside-atomic suggest that
"speculative loads are allowed; a load which is part of a race returns undef,
but does not have undefined behavior".
This is different from standard C11 semantics where a racy program has
undefined behavior.

This "benign" race as %0 will have "undef" value, but this value is not used
when flag=false.

(2) The Early CSE pass removes the second load of 'a' considering it as
redundant.

IR
----
define i32 @_Z5readAb(i1 zeroext %flag) #3 {
entry:
  %0 = load i32* @a, align 4
  %. = select i1 %flag, i32 %0, i32 0
  %1 = load atomic i32* getelementptr inbounds (%"struct.std::atomic"* @x, i64 0, i32 0, i32 0) seq_cst, align 4
  %cmp1 = icmp eq i32 %1, 1
  %r1.0 = select i1 %cmp1, i32 %0, i32 42
  %add = add nsw i32 %., %r1.0
  ret i32 %add
}

As a result of this transformation, the race introduced in step (1) is no
longer benign.

The shared memory access sequences in the source program is R_sc(x); R_na(a)
when flag=false and x=1.
In the target program the shared memory access sequence is R_na(a); R_sc(x)
when flag=false.

The transformation r=R_na(a); R_sc(x); r1=R_na(a) ~> r=R_na(a); R_sc(x); r1=r
is correct according to C11 because any program that can observe the difference
is racy and therefore has undefined semantics.
However, under the LLVM model where races do not have totally undefined
semantics, the transformation is incorrect.

Summary
--------
One of the two transformations has to be disabled.

The testcase and the IR files are attached.

Best Regards,
soham
Quuxplusone commented 9 years ago

Attached loadrace.zip (32790 bytes, application/x-zip-compressed): source and IR files

Quuxplusone commented 9 years ago

Both functions have unsynchronized access to a and are thus racey.

Quuxplusone commented 9 years ago
In the source program

x = 1 => x.store(1,std::memory_order_seq_cst) and
x == 1 => x.load(std::memory_order_seq_cst) == 1

The R_sc(X,1) in readA() reads-from W_sc(X,1) in writeA() which results in
synchronization and W_na(a,42) happens-before R_na(a,42) and thus synchronized.
If (x==1) test fails then W(x) and R(x) is not synchronized and in that case
R_na(a) does not take place.

Thus the source program is race free.
Quuxplusone commented 9 years ago
(In reply to comment #2)
> The R_sc(X,1) in readA() reads-from W_sc(X,1) in writeA() which results in
> synchronization

That is not correct; no synchronization happens here.

> and W_na(a,42) happens-before R_na(a,42)

No, there is no happens-before relation here.

If you reopen this again, please provide an explanation of why you think there
is a happens-before relation here, referencing the rules in [intro.multithread]
in the C++ standard to justify the steps in your explanation.
Quuxplusone commented 9 years ago
Reference
----------
ISO/IEC 14882:2011 Programming Language C++
Link: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf

> The R_sc(X,1) in readA() reads-from W_sc(X,1) in writeA() which results in
> synchronization

 >> That is not correct; no synchronization happens here.

 29.3 page - 1116
From 1.
"memory_order_release, memory_order_acq_rel, and memory_order_seq_cst: a store
operation performs
a release operation on the affected memory location."

From 2.
"An atomic operation A that performs a release operation on an atomic object M
synchronizes with an atomic
operation B that performs an acquire operation on M and takes its value from
any side effect in the release
sequence headed by A."

Considering these the R_sc(x,1) in readA() reads-from W_sc(x,1) in writeA()
which results in synchronization.

> and W_na(a,42) happens-before R_na(a,42)

>> No, there is no happens-before relation here.

1.10 [intro.multithreaded] page 13-14

"
11 An evaluation A inter-thread happens before an evaluation B if
— A synchronizes with B, or
— A is dependency-ordered before B, or
— for some evaluation X
— A synchronizes with X and X is sequenced before B, or
— A is sequenced before X and X inter-thread happens before B, or
— A inter-thread happens before X and X inter-thread happens before B.
"
"
12 An evaluation A happens before an evaluation B if:
— A is sequenced before B, or
— A inter-thread happens before B
"

In our example x=1 and x==1 synchronizes(sw); hence x=1 happens-before(hb) x==1.

Also a=42; is sequenced before(sb) x=1 in writeA() and in readA() x==1 is
sequenced before r1=a.

Hence a=42 ->(sb) x=1 ->(sw) x==1 ->(sb) r1=a which means a=42 ->(hb) r1=a and
the program is not racy.

If you reopen this again, please provide an explanation of why you think there
is a happens-before relation here, referencing the rules in [intro.multithread]
in the C++ standard to justify the steps in your explanation.
Quuxplusone commented 9 years ago

Apologies for misreading / misunderstanding, and thanks for your persistence. Yes, assuming writeA is only called once and readA is called with flag == false, this code is race-free, and LLVM's transformation is incorrect.

I agree with comment#0 that it is the load-combining within Early CSE that broke the code: load-combining should notionally reorder the loads so they are adjacent before removing them, and doing so would move the second load of 'a' to before the load-acquire of x. Reordering a load before an acquire is not a permissible transformation.

(I don't think there is an important semantic gap between LLVM and the C++11 memory model here, there's just a bug in this pass.)

Quuxplusone commented 9 years ago

Should be fixed with r228760, tests added with r228761.