GaloisInc / llvm-pretty-bc-parser

Parser for the llvm bitcode format
Other
60 stars 6 forks source link

Parsing failures #158

Open kquick opened 3 years ago

kquick commented 3 years ago

I have some candidate code (PX4_Autopilot) that is having parser decode failures with multiple symptoms:

1) Failure to decode an fcmp operation value. The operation values defined are 0-15, but the value that llvm-pretty-bc-parser is attempting to resolve is 24. 2) ignoring the above by defaulting to ftrue, I'm seeing invalid operand values for the instruction.

Here's the llvm-dis output:

    %147 = load float, float* %146, align 4, !dbg !217760
    %148 = fmul nsz arcp float %147, 0x3F8457AE140000000, !dbg !217761
    %149 = fcmp nsz arcp ogt float %145, %148, !dbg !217769

for llvm-pretty-bc-parser, the llvm-disasm has problems on %149, where the default to ftrue still shows:

   %149 = fcmp true float %145, %147, !dbg ...`

Note the second operand is decoded as %147 whereas llvm decodes to %148.

It's unknown if there are other parsing problems, this one just happened to fail the parser due to the lookup in (1) above and therefore be detectable.

kquick commented 3 years ago

n.b., the llvm-disasm helpfully shows the 0x3F8457AE140000000 constant in register %148 to be 1.0e-2

RyanGlScott commented 3 years ago

I think I'm encountering the same issue, or at least very similar one. To reproduce, compile the following file:

double a = 0;
void b() {
  for (int c = 0; c < 2; c++)
    a++;
  if (a)
    a = 7;
}

int main() {
  b();
  return 0;
}

With:

$ clang -O1 test.c -S -emit-llvm -fno-discard-value-names
$ clang -O1 test.c -c -emit-llvm -fno-discard-value-names

This will produce the following bitcode with Clang 10:

``` ; ModuleID = 'test.c' source_filename = "test.c" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-pc-linux-gnu" @a = dso_local local_unnamed_addr global double 0.000000e+00, align 8 ; Function Attrs: nofree norecurse nounwind uwtable define dso_local void @b() local_unnamed_addr #0 { entry: %0 = load double, double* @a, align 8, !tbaa !2 br label %for.body for.cond.cleanup: ; preds = %for.body %tobool = fcmp une double %inc, 0.000000e+00 %storemerge = select i1 %tobool, double 7.000000e+00, double %inc store double %storemerge, double* @a, align 8, !tbaa !2 ret void for.body: ; preds = %entry, %for.body %1 = phi double [ %0, %entry ], [ %inc, %for.body ] %c.04 = phi i32 [ 0, %entry ], [ %inc1, %for.body ] %inc = fadd double %1, 1.000000e+00 %inc1 = add nuw nsw i32 %c.04, 1 %cmp = icmp eq i32 %c.04, 0 br i1 %cmp, label %for.body, label %for.cond.cleanup } ; Function Attrs: nofree norecurse nounwind uwtable define dso_local i32 @main() local_unnamed_addr #0 { entry: call void @b() ret i32 0 } attributes #0 = { nofree norecurse nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } !llvm.module.flags = !{!0} !llvm.ident = !{!1} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{!"clang version 10.0.0-4ubuntu1 "} !2 = !{!3, !3, i64 0} !3 = !{!"double", !4, i64 0} !4 = !{!"omnipotent char", !5, i64 0} !5 = !{!"Simple C/C++ TBAA"} ```

Attempting to parse this with llvm-pretty-bc-parser fails on the fcmp une double %inc, 0.000000e+00 instruction:

λ> parseBitCodeFromFile "test.bc"
Left (Error {errContext = ["FUNC_CODE_INST_CMP2","@b","FUNCTION_BLOCK","FUNCTION_BLOCK_ID","value symbol table","MODULE_BLOCK","Bitstream"], errMessage = "parseField: unable to parse record field 4 of record Record {recordCode = 28, recordFields = [FieldLiteral (BitString {bsLength = 35, bsData = 4294967292}),FieldLiteral (BitString {bsLength = 5, bsData = 0}),FieldLiteral (BitString {bsLength = 5, bsData = 8}),FieldLiteral (BitString {bsLength = 5, bsData = 14})]}"})

The code which parses FUNC_CODE_INST_CMP2 is located here:

https://github.com/GaloisInc/llvm-pretty-bc-parser/blob/d1a605738498e32e366d2d0e7020d442f5d757fb/src/Data/LLVM/BitCode/IR/Function.hs#L986-L1026

Comparing that with the corresponding code in LLVM, I'm skeptical that the code in llvm-pretty-bc-parser is implemented correctly. In particular, the LLVM code parses the fast-math flags, whereas the llvm-pretty-bc-parser code tries to skip over the fast-math flags somewhere in the middle.

RyanGlScott commented 3 years ago

Indeed, llvm-pretty-bc-parser can successfully parse fcmp instructions that lack fast-math flags, but it will fail if has fast-math flags. This can easily be seen by compiling the following example with and without the -ffast-math flag:`

double a = 0;
int main() {
  if (a) {
    a++;
  }
  return 0;
}

If you compile without -ffast-math:

$ clang -O3 test.c -emit-llvm -S -fno-discard-value-names
$ clang -O3 test.c -emit-llvm -c -fno-discard-value-names

You will get this test.ll file:

``` ; ModuleID = 'test.c' source_filename = "test.c" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-pc-linux-gnu" @a = dso_local local_unnamed_addr global double 0.000000e+00, align 8 ; Function Attrs: nofree norecurse nounwind uwtable define dso_local i32 @main() local_unnamed_addr #0 { entry: %0 = load double, double* @a, align 8, !tbaa !2 %tobool = fcmp fast une double %0, 0.000000e+00 br i1 %tobool, label %if.then, label %if.end if.then: ; preds = %entry %inc = fadd fast double %0, 1.000000e+00 store double %inc, double* @a, align 8, !tbaa !2 br label %if.end if.end: ; preds = %if.then, %entry ret i32 0 } attributes #0 = { nofree norecurse nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="true" "use-soft-float"="false" } !llvm.module.flags = !{!0} !llvm.ident = !{!1} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{!"clang version 10.0.0-4ubuntu1 "} !2 = !{!3, !3, i64 0} !3 = !{!"double", !4, i64 0} !4 = !{!"omnipotent char", !5, i64 0} !5 = !{!"Simple C/C++ TBAA"} ```

And llvm-pretty-bc-parser can parse test.bc without issue, since the fcmp instructions lack fast-math flags. If you repeat the same steps with the -ffast-math flag, however:

$ clang -O3 test.c -emit-llvm -S -fno-discard-value-names -ffast-math
$ clang -O3 test.c -emit-llvm -c -fno-discard-value-names -ffast-math

You will get this test.ll file:

``` ; ModuleID = 'test.c' source_filename = "test.c" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-pc-linux-gnu" @a = dso_local local_unnamed_addr global double 0.000000e+00, align 8 ; Function Attrs: nofree norecurse nounwind uwtable define dso_local i32 @main() local_unnamed_addr #0 { entry: %0 = load double, double* @a, align 8, !tbaa !2 %tobool = fcmp fast une double %0, 0.000000e+00 br i1 %tobool, label %if.then, label %if.end if.then: ; preds = %entry %inc = fadd fast double %0, 1.000000e+00 store double %inc, double* @a, align 8, !tbaa !2 br label %if.end if.end: ; preds = %if.then, %entry ret i32 0 } attributes #0 = { nofree norecurse nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="true" "use-soft-float"="false" } !llvm.module.flags = !{!0} !llvm.ident = !{!1} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{!"clang version 10.0.0-4ubuntu1 "} !2 = !{!3, !3, i64 0} !3 = !{!"double", !4, i64 0} !4 = !{!"omnipotent char", !5, i64 0} !5 = !{!"Simple C/C++ TBAA"} ```

Note that we now have an fcmp instruction with a fast flag. As a consequence, llvm-pretty-bc-parser fails to parse it:

λ> parseBitCodeFromFile "test.bc"
Left (Error {errContext = ["FUNC_CODE_INST_CMP2","@main","FUNCTION_BLOCK","FUNCTION_BLOCK_ID","value symbol table","MODULE_BLOCK","Bitstream"], errMessage = "parseField: unable to parse record field 3 of record Record {recordCode = 28, recordFields = [FieldLiteral (BitString {bsLength = 5, bsData = 1}),FieldLiteral (BitString {bsLength = 5, bsData = 7}),FieldLiteral (BitString {bsLength = 5, bsData = 14}),FieldLiteral (BitString {bsLength = 10, bsData = 254})]}"})
RyanGlScott commented 3 years ago

To add a wrinkle in all of this, the example in https://github.com/GaloisInc/llvm-pretty-bc-parser/issues/158#issuecomment-910958621 demonstrates a situation where an fcmp instruction in an .ll file seemingly lacks fast-math flags, but the corresponding .bc file does have fast-math flags. This can be verified by running llvm-bcanalyzer-10 test.bc -dump, which shows an INST_CMP2 with four operands (one for the comparison type, two for the argument values, and one for the fast-math flags) instead of the usual three:

<INST_CMP2 op0=4294967292 op1=0 op2=8 op3=14/>

I'm not quite sure why this happens, but either way, the culprit is the presence of fast-math flags in the .bc file.

Ptival commented 3 years ago

I'm seeing the following with the current master:

  %149 = fcmp ogt float %145, %148, !dbg !DILocation(line: 62, column: 12, scope: !256754, inlinedAt: !256755)

Does this mean the problem is fixed?

RyanGlScott commented 3 years ago

I believe it is fixed, but I never figured out how to test against PX4_Autopilot, so I left this issue open as a reminder to do so—see https://github.com/GaloisInc/llvm-pretty-bc-parser/pull/161#issuecomment-920886753. (Is the code you're testing from PX4_Autopilot?)

kquick commented 3 years ago

I was encountering multiple issues with the PX4_Autopilot code base (using build-bom extracted BC files). It was definitely encountering the fcmp issue that is fixed above, but when I patched around that issue, it ran into other problems. Since it's a stream of bits, it's entirely possible that my workaround was insufficient to get things back on track and this was the only issue, but I was unable to use anything later than LLVM9 on that code base and successfully parse it, so I would recommend repeating that experiment to see how things stand now with the above fix in place.

RyanGlScott commented 3 years ago

I've been unable to even compile PX4_Autopilot with Clang to obtain .bc files, so I don't know how to test this myself. Can you describe how you did so?

Ptival commented 3 years ago

I can confirm that, after fixing some bugs in the handling of fast math flags, I also read the correct operand in a bitcode file obtained via LLVM 10.

I need to fix some other bugs to be able to read the bitcode file obtained from LLVM 12.

@RyanGlScott I am using a nix flake provided by @kquick to get bitcode files from PX4_Autopilot. To quote Kevin:

The way it works is that the build-bom generate-bitcode BLDCMD runs BLDCMD but it watches for clang compilations and when it sees one, it re-issues the clang compilation with options to generate LLVM bitcode. It then places the generated LLVM bitcode in a tar file and then adds that tarfile to the ELF file (.o or the exe) in a special section ".llvm_bitcode". The build-bom extract-bitcode is looking for that section to retrieve the tarfile, from which it can get the llvm bitcode file. The reason it uses a tarfile is that tarfiles can be concatentated together and the result is also a valid tarfile. So linking multiple .o files will concatenate their sections, which gives you a a tarfile with all the bitcodes in it.