XOR_DEPENDS' JIT via pre-comupted code segment pasting

Attempt to build JIT code via lookup tables of pre-assembled code bits, rather than assembling it up bit-by-bit. This should speed up JIT performance of the XOR_DEPENDS algorithm, one of its major weaknesses.

This is done firstly by interleaving dependency bits, pairwise. From each interleaved pair, mask off a portion of bits (up to 6, i.e. 3 inputs from the pair) and use this to lookup into a pre-computed code table, as well as a corresponding mask for correcting encoded registers / memory offsets.
With this method, the 32 bits from a pair can be JIT'd in 6 lots of operations, instead of the previous 3x16 = 48 instruction writes.

Initial impressions are that it gives a very modest gain (~10-20% on Haswell) in small block size (4KB) performance. Still needs further tuning, but I don't expect gains to be much.

It doesn't handle the case of eliminating the redundant XOR when the common queue isn't needed. It may never get handled as doing so likely reduces performance (since only 4 bits can be processed at a time, rather than 6), and I expect gains from better JIT'd code to be very minor (at best, removes one MOVDQA instruction from the loop).

animetosho / ParPar

XOR_DEPENDS' JIT via pre-comupted code segment pasting #1