GaloisInc / cryptol

Cryptol: The Language of Cryptography
https://galoisinc.github.io/cryptol/master/RefMan.html
BSD 3-Clause "New" or "Revised" License
1.14k stars 126 forks source link

Using result of `:readByteArray` on large file is slow #1428

Open qsctr opened 2 years ago

qsctr commented 2 years ago

In #346 the implementation of :readByteArray was changed to turn the input file into a huge number and bind it to split applied to the huge number, to reduce memory usage compared to building up a huge sequence literal.

https://github.com/GaloisInc/cryptol/blob/39dea16096539cb7e90b969c571f42044af29da4/src/Cryptol/REPL/Command.hs#L1046-L1061

However, using the resulting sequence turns out to be very slow when the file is big, because the result of split is not a whole new sequence in memory but rather a kind of lazy sequence that goes back to the original sequence and looks up the appropriate value whenever you index into it. And since the original sequence in this case is a huge bitvector represented as an Integer, the lookup of a byte within it results in doing a shift as pointed out by @yav, which is linear in the size of the integer.

https://github.com/GaloisInc/cryptol/blob/39dea16096539cb7e90b969c571f42044af29da4/src/Cryptol/Eval/Generic.hs#L1072-L1084

Therefore processing the whole result of :readByteArray takes quadratic time. Note that fully forcing the result of split does not help, because the result of the lookup is not cached, so it will be slow every time you use the sequence.

It seems to me that the best way of solving this would be to just construct the sequence value directly in memory in a form where lookup is constant time and add a way to bind it to values instead of just expressions. Alternatively we could optimize the lookup of a byte within the huge bitvector to be constant time, perhaps by using a series of testBit calls as suggested by @RyanGlScott. This could improve performance more generally any time we are looking up a small number of bits in a very big bitvector, although I'm not sure how often that occurs outside of :readByteArray.

RyanGlScott commented 2 years ago

The hgmp library provides Haskell bindings to some low-level Integer manipulation primitives from gmp, which may offer an alternative way to search for a particular byte in an Integer. In particular, mpz_import and/or mpz_export might be useful.