animetosho / ParPar

High performance PAR2 create client for NodeJS
193 stars 19 forks source link

XOR_DEPENDS' JIT via pre-comupted code segment pasting #1

Closed animetosho closed 8 years ago

animetosho commented 8 years ago

Attempt to build JIT code via lookup tables of pre-assembled code bits, rather than assembling it up bit-by-bit. This should speed up JIT performance of the XOR_DEPENDS algorithm, one of its major weaknesses.

This is done firstly by interleaving dependency bits, pairwise. From each interleaved pair, mask off a portion of bits (up to 6, i.e. 3 inputs from the pair) and use this to lookup into a pre-computed code table, as well as a corresponding mask for correcting encoded registers / memory offsets.
With this method, the 32 bits from a pair can be JIT'd in 6 lots of operations, instead of the previous 3x16 = 48 instruction writes.

Initial impressions are that it gives a very modest gain (~10-20% on Haswell) in small block size (4KB) performance. Still needs further tuning, but I don't expect gains to be much.

It doesn't handle the case of eliminating the redundant XOR when the common queue isn't needed. It may never get handled as doing so likely reduces performance (since only 4 bits can be processed at a time, rather than 6), and I expect gains from better JIT'd code to be very minor (at best, removes one MOVDQA instruction from the loop).

animetosho commented 8 years ago

After all the optimizations are in, speed gains at 4KB block sizes:

Not significant by any measure, but a gain is a gain I suppose.