Efficient Character Representations

Boarders commented 4 years ago

The problem

Currently our character data is represented as follows:

newtype ContinuousCharacter = CC (ExtendedReal, ExtendedReal)
-- where for the sake of completeness ExtendedReal is defined as:
newtype ExtendedReal = Cost Double

newtype StaticCharacter = SC BitVector

data  DynamicCharacter
    = Missing {-# UNPACK #-} !Word
    | DC      {-# UNPACK #-} !BitMatrix

Our bit vectors and bit matrices are based around @recursion-ninja's bv-little library and so these are represented as follows:

data  BitMatrix
    = BitMatrix {-# UNPACK #-} !Int {-# UNPACK #-} !BitVector

data  BitVector
    = BV
    { dim :: {-# UNPACK #-} !Word
    , nat :: !Natural
    }

Here the arbitrary precision natural number is represented in Haskell in accordance with the gmp library. This works extremely well for arbitrary alphabet sizes encoded as arbitrary length bit vectors but is space and time inefficient for some typical inputs where we know the precise alphabet size and it is small. For example, if we have DNA characters then these can be stored in a single Word8. This then has the added benefit that a block of such characters can be represented by an unboxed vector.

In order to measure this I took an input of the form:

input = fromList $ take n (cycle [1,2,3,4])

and wrote a fitch style operation (identical to the one we use on the Haskell side of our code):

fitch :: (Bits b) => b -> b -> (b, Word)
fitch l r
  | popCount (l .&. r) > 0 = (l .&. r, 0)
  | otherwise              = (l .|. r, 1)

which I then zipped over blocks of bitvectors encoded with the two different inputs (the benchmarks are here: https://github.com/Boarders/Char-Benchmarks/blob/master/Char_Bench.hs). I used a boxed vector of natural numbers as a stand-in for our current arbitrary length bit vectors and unboxed vectors of both Word8 and Word64 for possible new encodings.

If I take n = 100 (that is to say, a character block of length 100) then the results are as follows:

benchmarking characters/fitch-unboxed-Word8
time                 599.2 ns   (593.6 ns .. 608.8 ns)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 616.6 ns   (607.9 ns .. 627.8 ns)
std dev              31.77 ns   (27.83 ns .. 36.50 ns)
variance introduced by outliers: 69% (severely inflated)

benchmarking characters/fitch-unboxed-Word64
time                 777.8 ns   (770.5 ns .. 786.1 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 784.3 ns   (776.1 ns .. 793.6 ns)
std dev              29.44 ns   (22.43 ns .. 36.34 ns)
variance introduced by outliers: 53% (severely inflated)

benchmarking characters/fitch-boxed-natural
time                 2.191 μs   (2.141 μs .. 2.231 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 2.111 μs   (2.080 μs .. 2.147 μs)
std dev              112.9 ns   (87.44 ns .. 146.2 ns)
variance introduced by outliers: 68% (severely inflated)

Taking n = 100000 gives the following:

benchmarking characters/fitch-unboxed-Word8
time                 571.2 μs   (567.8 μs .. 575.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 577.5 μs   (573.3 μs .. 585.3 μs)
std dev              18.52 μs   (12.56 μs .. 27.81 μs)
variance introduced by outliers: 23% (moderately inflated)

benchmarking characters/fitch-unboxed-Word64
time                 727.1 μs   (721.5 μs .. 732.8 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 721.3 μs   (719.1 μs .. 723.9 μs)
std dev              7.784 μs   (6.223 μs .. 10.63 μs)

benchmarking characters/fitch-boxed-natural
time                 11.33 ms   (11.23 ms .. 11.42 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 11.52 ms   (11.46 ms .. 11.58 ms)
std dev              152.0 μs   (123.4 μs .. 194.4 μs)

Note here that, from the perspective of these operations, boxed vectors of arbitrary precision nats do not scale linearly whereas it looks like that is the case (in this range) for unboxed vectors (this is probably due to whether the input can be stored in L1/L2 cache and various other mysteries of modern hardware).

This suggests there is a lot to be gained by encoding shorter alphabets in fixed width types.

Proposal

I think we should take the following steps:

Introduce new character types of the form:


newtype StaticCharacter8  = StaticCharacter8 Word8
newtype StaticCharacter64 = StaticCharacter64 Word64
newtype StaticCharacter128 = StaticCharacter128 Word128   -- we should see if these
newtype StaticCharacter256 = StaticCharacter256 Word2256 -- pay their way in performance

data DynamicCharacter8 = Missing {-# UNPACK #-} !Word | DC8 {-# UNPACK #-} !(Unboxed.Vector Word8) -- etc.


This will require the character blocks can easily be extensible with new types which the design in the `new-graph-representation` branch allows for. We can also make life easier by having lenses onto the various components of static and dynamic characters from a character block. That should make it seem as though not much has changed from the perspective of a high level user of the character sequence type.

  - For the continuous character I don't think there are many gains to be had and these benchmarks don't measure anything about those types but it is perhaps worth changing the tuple to a record of `ExtendedReal`s which is strict in each field as this is usually an easy performance gain.

  - Using fixed width types opens up the possibility that a character block can use an unboxed vector. We can do that for all of the aforementioned characters and also we can write an `Unbox` instance for the continuous character type (which will, probably in turn, require such an instance to be written for an ExtendedReal`). This should offer good performance gains.

  - Above I have listed types `Word128` and `Word256`. Most modern CPUs have single instructions for performing operations on 128 or 256 wide registers. For example I wrote (and tested) the following C code:

```C
uint64_t* mmfitch (int len, const uint64_t* bv1, const uint64_t* bv2){
  uint64_t* res  __attribute__((aligned (16))) = malloc(len*8);
  uint64_t* start = res;
  for (int i = 0; i < (len + 1) / 2; i++){
    __m128i lvals = _mm_loadu_si128((__m128i*) bv1);
    __m128i rvals = _mm_loadu_si128((__m128i*) bv2);
    __m128i __mres;
    if (!_mm_testz_si128(lvals, rvals)){
        // if there is an intersection then return it.
      __mres = _mm_and_si128(lvals, rvals);
      }
    else {
      // otherwise take the union.
      __mres = _mm_or_si128 (lvals, rvals);
    }
    _mm_store_si128((__m128i*) res, __mres);
    bv1 +=2;
    bv2 +=2;
    res +=2;
  }
  return start;
}

This is elaborate as a result of the intel intrinsics, but if you squint it is just the usual fitch-style operation. When I have the time I am going test, via ffi, what the performance is like for this when called from Haskell. I should note that such SIMD instructions will eventually be added to the native code generator of Haskell and as such if we were to use any ffi code it would be (hopefully) short-lived. We will need to experiment with using this via ffi in Haskell to see whether it is actually worthwhile (I haven't done this yet as this function expects 16 aligned inputs and so some finessing needs to take place on the Haskell side beyond the ffi finessing).

Downsides

This leads to multiple different character representations which will lead to both more code with similar (but subtlety different) operations. We will probably have to write boxed and unboxed implementations for several operations (for example a fitch-style operation). This therefore will lead to more code and probably a large-ish re-factor (though i don't envision it being too painful). There are probably others I haven't thought about.

(Other) Upsides

Beyond those mentioned already in terms of performance, I think this type of project will lead to packed characters being much easier to add to PCG. On top of this, if the SIMD 128 bit operations have good performance and we can figure out the right sort of bit masking (also using intel intrinsics) then this can lead to potentially 16x single core parallelism on characters that fit in a single byte (though that doesn't take into account the masking cost). I think it is well worth exploring this possibility.

recursion-ninja commented 4 years ago

I'll just note that in the dynamic character rework that should be merged into master soon, the type of DynamicCharacter will be changed to:

data  DynamicCharacter
    = Missing {-# UNPACK #-} !Word
    | DC      {-# UNPACK #-} !(Vector (BitVector, BitVector, BitVector))

I think it would be great to follow your idea and make either of these:

data  DynamicCharacterUnboxedByte
    = Missing {-# UNPACK #-} !Word
    | DC      {-# UNPACK #-} !(Vector (Word8, Word8, Word8))

data  DynamicCharacterUnboxedByte
    = Missing {-# UNPACK #-} !Word
    | DC      {-# UNPACK #-} !(ByteString, ByteString, ByteString)

Whichever of the above empirically appears to be more performant.

With the "small" unboxed byte variant of the dynamic character, we can replace using the memoized TCM with a reference to looking up costs and medians in a completely pre-computed TCM.

This will take 64KB to store the medians for an alphabet of size 8.
This will take 512KB to store the costs (as a Word) for an alphabet of size 8.
This will take 1KB + 8KB for an alphabet of size 5 (like DNA).

I think the performance gain we can get in the general case (where we don't special case the "discrete metric" or "L1-norm metric") will be an additional boon from this approach.

Additionally, there might be a use case for unboxed dynamic characters for alphabets that fit into a Word:

data  DynamicCharacterUnboxedWord
    = Missing {-# UNPACK #-} !Word
    | DC      {-# UNPACK #-} !(Vector (Word, Word, Word))

I think that the unboxed versions of the string alignment will be more performant than using the boxed BitVectors. The Word-sized dynamic character would cover protein sequences and some small language alphabets.

We would still need to use the memoized TCM for storing the costs and medians, but there might be some benefit of the types being Unboxed. Perhaps we could write a quick fork of the hashtables or concurrent-hashtables library to use an unboxed mutable vector "under the hood" rather the boxed one in the published version of the libraries. There might be a noticeable performance gain to be had in doing this.

@Boarders , do you think these ideas are reasonable and in line with what the new character type is capable of?

Boarders commented 4 years ago

@recursion-ninja : Those ideas sound reasonable to me and fit perfectly into this proposal. I hadn't considered the impact on the TCM but that makes perfect sense. I'm unsure what underlying representation the various hashtables libraries use but it sounds definitely worth a look. I think it might, in any case, be that hashing is quite a bit faster for these smaller types so that might be an added benefit for hashtable look up.

recursion-ninja commented 4 years ago

I forgot about the time it takes to generate the hash!

I imagine that for (Word, Word) it would be as simple as XOR, or XOR with a salt. Probably faster in practice than folding XOR over the byte array behind the Natural of a BitVector.

Boarders commented 4 years ago

Since I had the code already available I ran ByteString vs Unboxed.Vector on the n = 100000 input and it looks like ByteString has some pretty poor performance for this sort of thing (I think this is because some of the ByteString fusion framework isn't happy with this example, when I get some time I will make an issue on bytestring to see if this can be fixed or is inherent).

The relevant bits of code are:

-- Note these do not keep the costs as that was awkward to do and I wanted a quick benchmark
fitch' :: (Bits b) => b -> b -> b
fitch' l r
  | popCount (l .&. r) > 0 = l .&. r
  | otherwise              = l .|. r

fitchByteString' :: ByteString -> ByteString -> ByteString
fitchByteString' l r =
  ByteString.pack (ByteString.zipWith (fitch') l r)

fitchUnbox'
  :: Unboxed.Vector Word8 -> Unboxed.Vector Word8 -> Unboxed.Vector Word8
fitchUnbox' = Unboxed.zipWith fitch'

The results are as follows:

benchmarking characters/fitch'-unbox
time                 423.5 μs   (420.9 μs .. 427.2 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 423.0 μs   (421.8 μs .. 424.7 μs)
std dev              4.827 μs   (3.398 μs .. 6.618 μs)

benchmarking characters/fitch'-bytestring
time                 4.133 ms   (4.097 ms .. 4.183 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 4.107 ms   (4.075 ms .. 4.140 ms)
std dev              99.11 μs   (72.28 μs .. 142.7 μs)

Bytestring also presents a challenge with keeping the cost so in some sense it is fortunate the performance is 10x worse.

recursion-ninja commented 4 years ago

Yikes! That's way worse than I expected.

Do you think ByteArray would behave better? I wrote ByteString in my example above, but what I actually had in mind was byte array from Haskell's primitives.

It looks like if we use the supplied primitives from GHC.Prim then we will have to write our own stateful fold by hand using unlifted types. It might be worth it, it might be comparable to Vector Word8 which has a nicer interface to use.

amnh / PCG

Efficient Character Representations #164