Bad perf with memory refinement query

regehr commented 7 months ago

here's an interesting test case that @tanmaytirpankar found:

define void @insert_store_nonconst_index_not_known_valid_by_and(ptr %0, i8 zeroext %1, i32 %2) {
 %4 = load <16 x i8>, ptr %0, align 16
 %5 = and i32 %2, 16
 %6 = insertelement <16 x i8> %4, i8 %1, i32 %5
 store <16 x i8> %6, ptr %0, align 16
 ret void
}

it exposes two issues. first, if the insertelement index is OOB, the result is a vector full of poison. however, on the ARM side, we get something like:

define void @insert_store_nonconst_index_not_known_valid_by_and(ptr nocapture writeonly %0, i8 %1, i32 %2) local_unnamed_addr #0 {
arm_tv_entry:
  %a2_2 = and i32 %2, 16
  %a2_3 = zext nneg i32 %a2_2 to i64
  %3 = getelementptr i8, ptr %0, i64 %a2_3
  store i8 %1, ptr %3, align 1
  ret void
}

of course at this level, an OOB store is UB. so we don't have a refinement, and moreover, I don't know that we can easily fix this in the general case, as we can for other target-side UB like divide by zero and shift-past-bitwidth.

but that's not all! if we force the index to be in bounds like this:

define void @insert_store_nonconst_index_not_known_valid_by_and(ptr %0, i8 zeroext %1, i32 %2) {
 %4 = load <16 x i8>, ptr %0, align 16
 %5 = and i32 %2, 15
 %6 = insertelement <16 x i8> %4, i8 %1, i32 %5
 store <16 x i8> %6, ptr %0, align 16
 ret void
}

then of course we don't have any problems with UB, but we still can't prove refinement. here's a src-tgt pair that doesn't seem too difficult, but it runs for 99 minutes and then I get "ERROR: SMT Error: smt tactic failed to show goal to be sat/unsat memout":

define void @src(ptr %0, i8 zeroext %1, i32 %2) {
 %x4 = load <16 x i8>, ptr %0, align 1
 %z = and i32 %2, 15
 %x6 = insertelement <16 x i8> %x4, i8 %1, i32 %z
 store <16 x i8> %x6, ptr %0, align 1
 ret void
}

define void @tgt(ptr %0, i8 %1, i32 %2) {
arm_tv_entry:
   %z = and i32 %2, 15
   %3 = getelementptr i8, ptr %0, i32 %z
  store i8 %1, ptr %3, align 1
  ret void
}

regehr@ohm:~/alive2-regehr/build$ time ./alive-tv foo2.ll --disable-poison-input --disable-undef-input --smt-to=10000000

----------------------------------------
define void @src(ptr %#0, i8 zeroext %#1, i32 %#2) {
#3:
  %x4 = load <16 x i8>, ptr %#0, align 1
  %z = and i32 %#2, 15
  %x6 = insertelement <16 x i8> %x4, i8 zeroext %#1, i32 %z
  store <16 x i8> %x6, ptr %#0, align 1
  ret void
}
=>
define void @tgt(ptr %#0, i8 %#1, i32 %#2) {
arm_tv_entry:
  %z = and i32 %#2, 15
  %#3 = gep ptr %#0, 1 x i32 %z
  store i8 %#1, ptr %#3, align 1
  ret void
}

ERROR: SMT Error: smt tactic failed to show goal to be sat/unsat memout

real    99m30.728s
user    98m27.443s
sys 0m42.087s
regehr@ohm:~/alive2-regehr/build$

nunoplopes commented 7 months ago

LLVM is buggy. The generated assembly is:

src:
        and     w8, w2, #0x10
        strb    w1, [x0, w8, uxtw]
        ret

It may store a byte OOB where the original code did not.

regehr commented 7 months ago

ah.... thanks Nuno, we missed that. they're not going to be happy about this since fixing it will add instructions, unless the fix is to make OOB InsertElement instructions UB.

I'll leave this issue open for the performance problem, but of course close if you want

nunoplopes commented 7 months ago

we need to improve perf for this:

define void @src(ptr %0, i8 zeroext %1, i32 %2) {
 %4 = load <16 x i8>, ptr %0, align 16
 %5 = and i32 %2, 15
 %6 = insertelement <16 x i8> %4, i8 %1, i32 %5
 store <16 x i8> %6, ptr %0, align 16
 ret void
}

define void @tgt(ptr nocapture writeonly %0, i8 %1, i32 %2) {
  %a2_2 = and i32 %2, 15
  %a2_3 = zext nneg i32 %a2_2 to i64
  %x3 = getelementptr i8, ptr %0, i64 %a2_3
  store i8 %1, ptr %x3, align 1
  ret void
}

Essentially the refinement query uses a forall-quantified idx to go through the memories. It should be unrolled in small (all?) cases.

AliveToolkit / alive2

Bad perf with memory refinement query #983