Open Ptival opened 2 years ago
I'm going to list things I have attempted that have not quite succeeded.
This actually seems to make things slower, going from 130s to 170s over my running example.
It seems we spend more time in >>=.\
, which I think means the lambda within >>=
.
The version on master boils down to:
$c>>_rgvi
= \ @ a_a9mB @ b_a9mC eta_XfM eta1_Xvx ->
\ r1_ab2X ->
let { m1_sbv8 = (eta_XfM `cast` <Co:10>) r1_ab2X } in
(\ s1_ab44 ->
case ((m1_sbv8 `cast` <Co:6>) s1_ab44) `cast` <Co:13> of {
Left e1_ab3d -> (Left e1_ab3d) `cast` <Co:7>;
Right x_ab3f ->
case x_ab3f of { (a1_ab47, s'_ab48) ->
((((eta1_Xvx `cast` <Co:10>) r1_ab2X) `cast` <Co:6>) s'_ab48)
`cast` <Co:6>
}
})
`cast` <Co:17>
While the monomorphized version looks like:
$c>>_rgFk
= \ @ a_a9dj @ b_a9dk eta_XfL eta1_Xvv ->
\ env_a7ds st_a7dt ->
case (eta_XfL `cast` <Co:2>) env_a7ds st_a7dt of {
Left e_a7du -> Left e_a7du;
Right ds_daq9 ->
case ds_daq9 of { (st'_a7dv, a1_a7dw) ->
(eta1_Xvv `cast` <Co:2>) env_a7ds st'_a7dv
}
}
I don't get why it's slower honestly.
We spend about 4% of the time in getBitStringPartial
. I tried to see if we could cut on that by avoiding a redundant check and doing some manual inlining of splitWord
in the process based on information we get.
This barely makes any difference in my tests unfortunately.
PValMd
by their hash rather than their values c8435b45e56411103a4fb319121c9d06a83eb57cThe dedupMetadata
functions performs many lookups in a map keyed by values of type PValMd
.
Instead, I tried to implement this as a HashMap
keyed by the hashes of those same values, with bins as values to handle collisions correctly.
This is mildly faster, but allocates slightly more.
I'm still not quite sure how to dissect the main cost centers even more. One of them is the pretty-printing and the output IO, but the other two are just reported as "the continuation of >>=
" (one for the Parse
monad, one for the GetBits
monad), and I'm not sure how to get more details on this.
Thanks for digging into this! It sounds like there might not be much low-hanging fruit, and that progress might require us to just rethink how we parse (i.e., avoid eagerly parsing everything in the bitcode file)
Agreed: knowing what didn't work is helpful!
One item to think about is that the bitcode documentation (https://www.llvm.org/docs/BitCodeFormat.html#blocks) makes the assertion that
Block definitions allow the reader to efficiently skip blocks in constant time if the reader wants a summary of blocks, or if it wants to efficiently skip data it does not understand. The LLVM IR reader uses this mechanism to skip function bodies, lazily reading them on demand.
I think our current nested-monad scheme doesn't really use (and perhaps cannot use) this technique. I don't know if it would represent savings in the event that full details were known, but if parsing of some blocks could be lazily deferred that might be an area for a reasonable win.
Yeah, I think we can try and implement those skips, though, for function bodies, I believe we are reading them all since we output them?
I found the profiteur
tool output to actually be much more interesting!
Here is an output, expanded a lot, and annotated:
The regions correspond to:
dedupMetadata
)mcommas
as part of ppCall
)ppCall
)Region 8 should be able to drastically reduce:
fsep
in favor of hsep
as I proposed here: https://github.com/elliottt/llvm-pretty/pull/89fsep
if it ever gets optimized for cases where the document width is unbounded, which I have suggested here: https://github.com/quchen/prettyprinter/issues/215 (even though we current use pretty
and not prettyprinter
, I think the fix would likely be the same)This list is interesting. @RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around GetBits
are called a lot, and allocating anything is going to be really bad.
(Of course, all of this would be complemented well by doing less work by disassembling on-demand - but optimizing the individual bits will still help that)
@RyanGlScott do you think any of these might be interesting candidates for unlifted data tricks? Particularly those parts around
GetBits
are called a lot, and allocating anything is going to be really bad.
At a glance, I don't know how much GetBits
would benefit from UnliftedDatatypes
. The definition of GetBits
is:
That's pretty svelte to begin with, and the SubWord
data type is itself pretty straightforward:
I'd be inclined to try optimizing Get
before the surrounding parts of GetBits
. (Of course, this is all speculation on my part.)
One important element here is avoiding GHC 9.0.2:
$ time -f '%e real secs, %M KB $(ghc --version | sed -e 's/.*version //')" cabal run llvm-disasm -- input-700KB-file.bc | wc
gave me:
real secs | Mem (KB) | GHC ver |
---|---|---|
1.81 | 160464 |
8.8.4 |
1.80 | 159120 |
8.10.7 |
45.81 | 2958812 |
9.0.2 |
1.66 | 195076 |
9.2.4 |
where wc
verified that all the output sizes were 31588 lines in length.
Wow, that's incredible. I always had an odd feeling about GHC 9.0.2's performance, and this is a strong piece of evidence to justify that feeling.
llvm-disasm
is noticeably slower than its LLVM counterpart. This issue will attempt to keep track of data regarding this performance discrepancy and attempts at improving the situation.This is the current flame graph of running
llvm-disasm
on a ~30MB PX4 bitcode file generated by Clang 12:There is an almost even divide between the time spent parsing, and the time spent pretty-printing.
Parsing
It seems like one third of the time is spent gobbling bytes from the input stream. Perhaps one could check whether there is something inherently inefficient in the way we consume input.
The other two thirds seem dominated by
parseModuleBlock
:From left to right, the columns correspond to
Data.LLVM.BitCode.IR.Function.finalizeStmt
,Data.Generics.Uniplate.Internal.Data.descendBiData
, andData.LLVM.BitCode.IR.Function.parseFunctionBlockEntry
.Pretty-printing
(NOTE: this graph has been obtained with a branch where I replaced the
pretty
package withprettyprinter
, to see if it made a difference. It does not do much on its own.)On the pretty-printing side, we only see details of two cost centres:
Prettyprinter.Render.Text.renderIO
, andText.LLVM.PP.ppModule
.The former seems unavoidable, the latter is decomposed as such:
I haven't looked into this yet, but those
ppDebugLoc'
andppDebugInfo'
sure make things slow, I wonder if there's something to look into there.But, going back up a step, the pretty-printing flame graph also had this huge, unlabeled portion. My guess is that it likely corresponds to
Prettyprinter.layoutPretty
, since this is definitely called, doesn't appear elsewhere, and is likely heavy in computation. I'm not sure why it is not reported though.As mentioned, I tried to replace
pretty
withprettyprinter
, and it's not much faster, even with an unbounded-width layout. It definitely goes faster when usingrenderCompact
, but that one has very ugly output. I wonder if there's something we can do in between, given there's barely any interesting pretty-printing going on in our output.