GaloisInc / cryptol

Cryptol: The Language of Cryptography
https://galoisinc.github.io/cryptol/master/RefMan.html
BSD 3-Clause "New" or "Revised" License
1.14k stars 123 forks source link

Multithreaded support #787

Closed weaversa closed 4 years ago

weaversa commented 4 years ago

Is there anything holding back a multithreaded Cryptol interpreter? I think about this every time I write a map.

robdockins commented 4 years ago

I guess it wouldn't actually be that hard to add some multithreaded support to the interpreter. I'm slightly worried that we might discover some race conditions in the interpreter, but given the pure nature of Cryptol itself, we might actually be in good shape.

I think the bigger question is: will there be enough inherent parallelism in the algorithms we care about to make it worthwhile? Research in automatic parallelism for functional languages has been fairly disappointing, as I recall. We might be able to add a parallel map primitive to let the user be more explicit about where to fork parallel tasks... for purely data-parallel applications this might be worthwhile.

weaversa commented 4 years ago

I do have something specific in mind ---building the Merkle tree for PQ Merkle signature schemes is intense, but trivially parallelizable with the parallel map primitive you suggest.

robdockins commented 4 years ago

I was curious what it would take to do this; turns out it isn't super hard (at least for the concrete evaluator). Some very initial, but encouraging results:

$ time $CRYPTOL examples/AES.cry -c 'sum (map (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])'
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m14.903s
user    0m14.754s
sys     0m0.136s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N1
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m15.926s
user    0m15.774s
sys     0m0.139s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N4
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m4.537s
user    0m16.920s
sys     0m0.278s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N8
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m4.064s
user    0m28.482s
sys     0m0.823s

Speedup is clearly less than linear for this function, but is pretty substantial nonetheless.

weaversa commented 4 years ago

I wonder if some of the time is spent computing the key schedule. That only needs to happen once here, then it can be reused for every of the 101 runs.

Your parmap is great!

Now, for the second part, can you do, say, a million encrypts? Or, does Cryptol run out of memory?

The final part would be, obviously (@atomb), for the AES here to be a native function.

robdockins commented 4 years ago

I tried something a bit more modest first: 10000 encrypts. This has been slowly growing memory usage and the machine is getting sluggish, so that's not necessarily great. I guess we'd probably be better off using a more explicit thread pool rather than just spawning all the threads and letting GHC's runtime deal with it.

robdockins commented 4 years ago

A little more playing around and I discovered that moving to a more map-reduce style significantly improved the memory usage.

import AES

sumEnc2 xs = sum (map sumSegment ys)
  where
  ys = groupBy`{100} xs
  sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)

This version seems to run in basically constant space and fills up as many cores as I give it. Hypothesis: reducing partial values early allows the memory consumed by the parallel tasks to be collected, whereas a big sequence doesn't. We might be able to improve the internals somehow to release memory once the value is computed, but it isn't immediately clear how.

weaversa commented 4 years ago

A good challenge would be to build a Merkle tree. It could be parmap (with your trick) is perfect but that Cryptol objects just use too much memory to represent such a thing for reasonable parameters. So that may be another ticket.

robdockins commented 4 years ago

For posterity, here is the simple examples I've been experimenting with:

import AES

mapReduce :
  {chunkSize, a, b, n}
  (fin n, fin chunkSize, chunkSize >= 1) =>
  (a -> b) -> (b -> b -> b) -> b -> [n]a -> b
mapReduce f g u xs = foldl g z zs
  where
  red : {m} (fin m) => [m]b -> b
  red = foldl g u

  y   : [ n%chunkSize ] a
  ys  : [ n - n%chunkSize ] a
  yss : [ n / chunkSize ][chunkSize] a

  (y,ys) = splitAt`{ n % chunkSize } xs
  yss = groupBy`{chunkSize} ys

  z  = red (parmap f y)
  zs = map (\x -> red (parmap f x)) yss

sumEnc xs = sum (parmap (\x -> (aesEncrypt (x, zero))) xs)

sumEnc2 xs = sum (map sumSegment ys)
  where
  ys = groupBy`{100} xs
  sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)

sumEnc3 = mapReduce`{100} (\x -> aesEncrypt (x,zero)) (+) 0

sumEnc has rather poor memory usage as the number of elements scales. After some time, the processor usage would periodically stall and the machine would swap for a while before saturating the cores again; I interpret this as major garbage collection events. I eventually had to cancel largeish jobs using this method before they finished.

This seems to be fixed in both sumEnc2 and sumEnc3, which appear to have constant memory usage. They were both able to successfully run with an input argument of [1 .. 100000], which takes several hours to run. The job seemed to entirely saturate the 4 cores I allocated for the job the entire time. GC pauses appeared to be minimal and the machine did not have swapping issues.

robdockins commented 4 years ago

The parmap primitive in now included in master as of #792, which should allow some additional experimentation.

robdockins commented 4 years ago

In #868, I've updated parmap to reduce its arguments to normal form (NF) rather than just to weak head normal form (WHNF) and used a strictly-accumulating version of sum. With this done, the performance difference between the naive sumEnc above and the tricker mapReduce version basically disappears (mapReduce still has a minor advantage, but it is quite small).

robdockins commented 4 years ago

With #867 and #868 merged, I think we can consider this closed.