Ptival commented 2 years ago

llvm-disasm is noticeably slower than its LLVM counterpart. This issue will attempt to keep track of data regarding this performance discrepancy and attempts at improving the situation.

This is the current flame graph of running llvm-disasm on a ~30MB PX4 bitcode file generated by Clang 12:

There is an almost even divide between the time spent parsing, and the time spent pretty-printing.

Parsing

It seems like one third of the time is spent gobbling bytes from the input stream. Perhaps one could check whether there is something inherently inefficient in the way we consume input.

The other two thirds seem dominated by parseModuleBlock:

From left to right, the columns correspond to Data.LLVM.BitCode.IR.Function.finalizeStmt, Data.Generics.Uniplate.Internal.Data.descendBiData, and Data.LLVM.BitCode.IR.Function.parseFunctionBlockEntry.

Pretty-printing

(NOTE: this graph has been obtained with a branch where I replaced the pretty package with prettyprinter, to see if it made a difference. It does not do much on its own.)

On the pretty-printing side, we only see details of two cost centres: Prettyprinter.Render.Text.renderIO, and Text.LLVM.PP.ppModule.

The former seems unavoidable, the latter is decomposed as such:

I haven't looked into this yet, but those ppDebugLoc' and ppDebugInfo' sure make things slow, I wonder if there's something to look into there.

But, going back up a step, the pretty-printing flame graph also had this huge, unlabeled portion. My guess is that it likely corresponds to Prettyprinter.layoutPretty, since this is definitely called, doesn't appear elsewhere, and is likely heavy in computation. I'm not sure why it is not reported though.

As mentioned, I tried to replace pretty with prettyprinter, and it's not much faster, even with an unbounded-width layout. It definitely goes faster when using renderCompact, but that one has very ugly output. I wonder if there's something we can do in between, given there's barely any interesting pretty-printing going on in our output.

Ptival commented 2 years ago

I'm going to list things I have attempted that have not quite succeeded.

1. Monomorphizing the Parse monad 8555e205875606786d9d87755fa24a29fa2a2f34

This actually seems to make things slower, going from 130s to 170s over my running example.

It seems we spend more time in >>=.\, which I think means the lambda within >>=.

The version on master boils down to:

$c>>_rgvi
  = \ @ a_a9mB @ b_a9mC eta_XfM eta1_Xvx ->
      \ r1_ab2X ->
        let { m1_sbv8 = (eta_XfM `cast` <Co:10>) r1_ab2X } in
        (\ s1_ab44 ->
           case ((m1_sbv8 `cast` <Co:6>) s1_ab44) `cast` <Co:13> of {
             Left e1_ab3d -> (Left e1_ab3d) `cast` <Co:7>;
             Right x_ab3f ->
               case x_ab3f of { (a1_ab47, s'_ab48) ->
               ((((eta1_Xvx `cast` <Co:10>) r1_ab2X) `cast` <Co:6>) s'_ab48)
               `cast` <Co:6>
               }
           })
        `cast` <Co:17>

While the monomorphized version looks like:

$c>>_rgFk
  = \ @ a_a9dj @ b_a9dk eta_XfL eta1_Xvv ->
      \ env_a7ds st_a7dt ->
        case (eta_XfL `cast` <Co:2>) env_a7ds st_a7dt of {
          Left e_a7du -> Left e_a7du;
          Right ds_daq9 ->
            case ds_daq9 of { (st'_a7dv, a1_a7dw) ->
            (eta1_Xvv `cast` <Co:2>) env_a7ds st'_a7dv
            }
        }

I don't get why it's slower honestly.

2. Avoiding a redundant comparison? a50bd8a3c2b140b94faab6a62ce4d7cac5b6c2be

We spend about 4% of the time in getBitStringPartial. I tried to see if we could cut on that by avoiding a redundant check and doing some manual inlining of splitWord in the process based on information we get.

This barely makes any difference in my tests unfortunately.

3. Looking up `PValMd` by their hash rather than their values c8435b45e56411103a4fb319121c9d06a83eb57c

https://github.com/GaloisInc/llvm-pretty-bc-parser/blob/062dc02b2e5a1738a7f2e4f0b225b7c94c714a4d/src/Data/LLVM/BitCode/IR/Metadata.hs#L273-L294

The dedupMetadata functions performs many lookups in a map keyed by values of type PValMd.

https://github.com/GaloisInc/llvm-pretty-bc-parser/blob/bea158b759d7315cdb4f4c4a364c9fd2c2f9b5a4/src/Data/LLVM/BitCode/Parse.hs#L409

https://github.com/elliottt/llvm-pretty/blob/3cf1539bd6f70ccd319a4aa9bc0665951c26c9fa/src/Text/LLVM/AST.hs#L1004-L1011

Instead, I tried to implement this as a HashMap keyed by the hashes of those same values, with bins as values to handle collisions correctly.

This is mildly faster, but allocates slightly more.

I'm still not quite sure how to dissect the main cost centers even more. One of them is the pretty-printing and the output IO, but the other two are just reported as "the continuation of >>=" (one for the Parse monad, one for the GetBits monad), and I'm not sure how to get more details on this.

travitch commented 2 years ago

Thanks for digging into this! It sounds like there might not be much low-hanging fruit, and that progress might require us to just rethink how we parse (i.e., avoid eagerly parsing everything in the bitcode file)

kquick commented 2 years ago

Agreed: knowing what didn't work is helpful!

One item to think about is that the bitcode documentation (https://www.llvm.org/docs/BitCodeFormat.html#blocks) makes the assertion that

Block definitions allow the reader to efficiently skip blocks in constant time if the reader wants a summary of blocks, or if it wants to efficiently skip data it does not understand. The LLVM IR reader uses this mechanism to skip function bodies, lazily reading them on demand.

I think our current nested-monad scheme doesn't really use (and perhaps cannot use) this technique. I don't know if it would represent savings in the event that full details were known, but if parsing of some blocks could be lazily deferred that might be an area for a reasonable win.

Ptival commented 2 years ago

Yeah, I think we can try and implement those skips, though, for function bodies, I believe we are reading them all since we output them?

Ptival commented 2 years ago

I found the profiteur tool output to actually be much more interesting!

Here is an output, expanded a lot, and annotated:

The regions correspond to:

Region 8 should be able to drastically reduce:

either by foregoing fsep in favor of hsep as I proposed here: https://github.com/elliottt/llvm-pretty/pull/89
or by keeping fsep if it ever gets optimized for cases where the document width is unbounded, which I have suggested here: https://github.com/quchen/prettyprinter/issues/215 (even though we current use pretty and not prettyprinter, I think the fix would likely be the same)

travitch commented 2 years ago

This list is interesting. @RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around GetBits are called a lot, and allocating anything is going to be really bad.

(Of course, all of this would be complemented well by doing less work by disassembling on-demand - but optimizing the individual bits will still help that)

RyanGlScott commented 2 years ago

@RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around GetBits are called a lot, and allocating anything is going to be really bad.

At a glance, I don't know how much GetBits would benefit from UnliftedDatatypes. The definition of GetBits is:

https://github.com/GaloisInc/llvm-pretty-bc-parser/blob/4b2e00683932de7d7fa0e5097d054eb949cf907f/src/Data/LLVM/BitCode/GetBits.hs#L31

That's pretty svelte to begin with, and the SubWord data type is itself pretty straightforward:

https://github.com/GaloisInc/llvm-pretty-bc-parser/blob/4b2e00683932de7d7fa0e5097d054eb949cf907f/src/Data/LLVM/BitCode/GetBits.hs#L90-L91

I'd be inclined to try optimizing Get before the surrounding parts of GetBits. (Of course, this is all speculation on my part.)

kquick commented 1 year ago

One important element here is avoiding GHC 9.0.2:

$ time -f '%e real secs, %M KB  $(ghc --version | sed -e 's/.*version //')" cabal run llvm-disasm -- input-700KB-file.bc | wc

gave me:

real secs	Mem (KB)	GHC ver
1.81	`160464`	8.8.4
1.80	`159120`	8.10.7
45.81	`2958812`	9.0.2
1.66	`195076`	9.2.4

where wc verified that all the output sizes were 31588 lines in length.

RyanGlScott commented 1 year ago

Wow, that's incredible. I always had an odd feeling about GHC 9.0.2's performance, and this is a strong piece of evidence to justify that feeling.

GaloisInc / llvm-pretty-bc-parser

Performance analysis #176

Parsing

Pretty-printing

1. Monomorphizing the Parse monad 8555e205875606786d9d87755fa24a29fa2a2f34

2. Avoiding a redundant comparison? a50bd8a3c2b140b94faab6a62ce4d7cac5b6c2be

3. Looking up `PValMd` by their hash rather than their values c8435b45e56411103a4fb319121c9d06a83eb57c

GaloisInc / llvm-pretty-bc-parser

Performance analysis #176

Parsing

Pretty-printing

1. Monomorphizing the Parse monad 8555e205875606786d9d87755fa24a29fa2a2f34

2. Avoiding a redundant comparison? a50bd8a3c2b140b94faab6a62ce4d7cac5b6c2be

3. Looking up PValMd by their hash rather than their values c8435b45e56411103a4fb319121c9d06a83eb57c

3. Looking up `PValMd` by their hash rather than their values c8435b45e56411103a4fb319121c9d06a83eb57c