Open Boarders opened 4 years ago
I'll just note that in the dynamic character rework that should be merged into master
soon, the type of DynamicCharacter
will be changed to:
data DynamicCharacter
= Missing {-# UNPACK #-} !Word
| DC {-# UNPACK #-} !(Vector (BitVector, BitVector, BitVector))
I think it would be great to follow your idea and make either of these:
data DynamicCharacterUnboxedByte
= Missing {-# UNPACK #-} !Word
| DC {-# UNPACK #-} !(Vector (Word8, Word8, Word8))
data DynamicCharacterUnboxedByte
= Missing {-# UNPACK #-} !Word
| DC {-# UNPACK #-} !(ByteString, ByteString, ByteString)
Whichever of the above empirically appears to be more performant.
With the "small" unboxed byte variant of the dynamic character, we can replace using the memoized TCM with a reference to looking up costs and medians in a completely pre-computed TCM.
Word
) for an alphabet of size 8.I think the performance gain we can get in the general case (where we don't special case the "discrete metric" or "L1-norm metric") will be an additional boon from this approach.
Additionally, there might be a use case for unboxed dynamic characters for alphabets that fit into a Word
:
data DynamicCharacterUnboxedWord
= Missing {-# UNPACK #-} !Word
| DC {-# UNPACK #-} !(Vector (Word, Word, Word))
I think that the unboxed versions of the string alignment will be more performant than using the boxed BitVector
s. The Word
-sized dynamic character would cover protein sequences and some small language alphabets.
We would still need to use the memoized TCM for storing the costs and medians, but there might be some benefit of the types being Unboxed
. Perhaps we could write a quick fork of the hashtables
or concurrent-hashtables
library to use an unboxed mutable vector "under the hood" rather the boxed one in the published version of the libraries. There might be a noticeable performance gain to be had in doing this.
@Boarders , do you think these ideas are reasonable and in line with what the new character type is capable of?
@recursion-ninja : Those ideas sound reasonable to me and fit perfectly into this proposal. I hadn't considered the impact on the TCM but that makes perfect sense. I'm unsure what underlying representation the various hashtables
libraries use but it sounds definitely worth a look. I think it might, in any case, be that hashing is quite a bit faster for these smaller types so that might be an added benefit for hashtable look up.
I forgot about the time it takes to generate the hash!
I imagine that for (Word, Word)
it would be as simple as XOR, or XOR with a salt. Probably faster in practice than folding XOR over the byte array behind the Natural
of a BitVector
.
Since I had the code already available I ran ByteString
vs Unboxed.Vector
on the n = 100000
input and it looks like ByteString
has some pretty poor performance for this sort of thing (I think this is because some of the ByteString fusion framework isn't happy with this example, when I get some time I will make an issue on bytestring to see if this can be fixed or is inherent).
The relevant bits of code are:
-- Note these do not keep the costs as that was awkward to do and I wanted a quick benchmark
fitch' :: (Bits b) => b -> b -> b
fitch' l r
| popCount (l .&. r) > 0 = l .&. r
| otherwise = l .|. r
fitchByteString' :: ByteString -> ByteString -> ByteString
fitchByteString' l r =
ByteString.pack (ByteString.zipWith (fitch') l r)
fitchUnbox'
:: Unboxed.Vector Word8 -> Unboxed.Vector Word8 -> Unboxed.Vector Word8
fitchUnbox' = Unboxed.zipWith fitch'
The results are as follows:
benchmarking characters/fitch'-unbox
time 423.5 μs (420.9 μs .. 427.2 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 423.0 μs (421.8 μs .. 424.7 μs)
std dev 4.827 μs (3.398 μs .. 6.618 μs)
benchmarking characters/fitch'-bytestring
time 4.133 ms (4.097 ms .. 4.183 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 4.107 ms (4.075 ms .. 4.140 ms)
std dev 99.11 μs (72.28 μs .. 142.7 μs)
Bytestring also presents a challenge with keeping the cost so in some sense it is fortunate the performance is 10x worse.
Yikes! That's way worse than I expected.
Do you think ByteArray
would behave better? I wrote ByteString
in my example above, but what I actually had in mind was byte array from Haskell's primitives.
It looks like if we use the supplied primitives from GHC.Prim
then we will have to write our own stateful fold by hand using unlifted types. It might be worth it, it might be comparable to Vector Word8
which has a nicer interface to use.
The problem
Currently our character data is represented as follows:
Our bit vectors and bit matrices are based around @recursion-ninja's bv-little library and so these are represented as follows:
Here the arbitrary precision natural number is represented in Haskell in accordance with the gmp library. This works extremely well for arbitrary alphabet sizes encoded as arbitrary length bit vectors but is space and time inefficient for some typical inputs where we know the precise alphabet size and it is small. For example, if we have DNA characters then these can be stored in a single
Word8
. This then has the added benefit that a block of such characters can be represented by an unboxed vector.In order to measure this I took an input of the form:
and wrote a fitch style operation (identical to the one we use on the Haskell side of our code):
which I then zipped over blocks of bitvectors encoded with the two different inputs (the benchmarks are here: https://github.com/Boarders/Char-Benchmarks/blob/master/Char_Bench.hs). I used a boxed vector of natural numbers as a stand-in for our current arbitrary length bit vectors and unboxed vectors of both Word8 and Word64 for possible new encodings.
If I take
n = 100
(that is to say, a character block of length 100) then the results are as follows:Taking
n = 100000
gives the following:Note here that, from the perspective of these operations, boxed vectors of arbitrary precision nats do not scale linearly whereas it looks like that is the case (in this range) for unboxed vectors (this is probably due to whether the input can be stored in L1/L2 cache and various other mysteries of modern hardware).
This suggests there is a lot to be gained by encoding shorter alphabets in fixed width types.
Proposal
I think we should take the following steps:
data DynamicCharacter8 = Missing {-# UNPACK #-} !Word | DC8 {-# UNPACK #-} !(Unboxed.Vector Word8) -- etc.
This is elaborate as a result of the intel intrinsics, but if you squint it is just the usual fitch-style operation. When I have the time I am going test, via ffi, what the performance is like for this when called from Haskell. I should note that such SIMD instructions will eventually be added to the native code generator of Haskell and as such if we were to use any ffi code it would be (hopefully) short-lived. We will need to experiment with using this via ffi in Haskell to see whether it is actually worthwhile (I haven't done this yet as this function expects 16 aligned inputs and so some finessing needs to take place on the Haskell side beyond the ffi finessing).
Downsides
This leads to multiple different character representations which will lead to both more code with similar (but subtlety different) operations. We will probably have to write boxed and unboxed implementations for several operations (for example a fitch-style operation). This therefore will lead to more code and probably a large-ish re-factor (though i don't envision it being too painful). There are probably others I haven't thought about.
(Other) Upsides
Beyond those mentioned already in terms of performance, I think this type of project will lead to packed characters being much easier to add to PCG. On top of this, if the SIMD 128 bit operations have good performance and we can figure out the right sort of bit masking (also using intel intrinsics) then this can lead to potentially 16x single core parallelism on characters that fit in a single byte (though that doesn't take into account the masking cost). I think it is well worth exploring this possibility.