Closed weaversa closed 4 years ago
I guess it wouldn't actually be that hard to add some multithreaded support to the interpreter. I'm slightly worried that we might discover some race conditions in the interpreter, but given the pure nature of Cryptol itself, we might actually be in good shape.
I think the bigger question is: will there be enough inherent parallelism in the algorithms we care about to make it worthwhile? Research in automatic parallelism for functional languages has been fairly disappointing, as I recall. We might be able to add a parallel map primitive to let the user be more explicit about where to fork parallel tasks... for purely data-parallel applications this might be worthwhile.
I do have something specific in mind ---building the Merkle tree for PQ Merkle signature schemes is intense, but trivially parallelizable with the parallel map primitive you suggest.
I was curious what it would take to do this; turns out it isn't super hard (at least for the concrete evaluator). Some very initial, but encouraging results:
$ time $CRYPTOL examples/AES.cry -c 'sum (map (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])'
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9
real 0m14.903s
user 0m14.754s
sys 0m0.136s
$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N1
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9
real 0m15.926s
user 0m15.774s
sys 0m0.139s
$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N4
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9
real 0m4.537s
user 0m16.920s
sys 0m0.278s
$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N8
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9
real 0m4.064s
user 0m28.482s
sys 0m0.823s
Speedup is clearly less than linear for this function, but is pretty substantial nonetheless.
I wonder if some of the time is spent computing the key schedule. That only needs to happen once here, then it can be reused for every of the 101 runs.
Your parmap
is great!
Now, for the second part, can you do, say, a million encrypts? Or, does Cryptol run out of memory?
The final part would be, obviously (@atomb), for the AES here to be a native function.
I tried something a bit more modest first: 10000 encrypts. This has been slowly growing memory usage and the machine is getting sluggish, so that's not necessarily great. I guess we'd probably be better off using a more explicit thread pool rather than just spawning all the threads and letting GHC's runtime deal with it.
A little more playing around and I discovered that moving to a more map-reduce style significantly improved the memory usage.
import AES
sumEnc2 xs = sum (map sumSegment ys)
where
ys = groupBy`{100} xs
sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)
This version seems to run in basically constant space and fills up as many cores as I give it. Hypothesis: reducing partial values early allows the memory consumed by the parallel tasks to be collected, whereas a big sequence doesn't. We might be able to improve the internals somehow to release memory once the value is computed, but it isn't immediately clear how.
A good challenge would be to build a Merkle tree. It could be parmap (with your trick) is perfect but that Cryptol objects just use too much memory to represent such a thing for reasonable parameters. So that may be another ticket.
For posterity, here is the simple examples I've been experimenting with:
import AES
mapReduce :
{chunkSize, a, b, n}
(fin n, fin chunkSize, chunkSize >= 1) =>
(a -> b) -> (b -> b -> b) -> b -> [n]a -> b
mapReduce f g u xs = foldl g z zs
where
red : {m} (fin m) => [m]b -> b
red = foldl g u
y : [ n%chunkSize ] a
ys : [ n - n%chunkSize ] a
yss : [ n / chunkSize ][chunkSize] a
(y,ys) = splitAt`{ n % chunkSize } xs
yss = groupBy`{chunkSize} ys
z = red (parmap f y)
zs = map (\x -> red (parmap f x)) yss
sumEnc xs = sum (parmap (\x -> (aesEncrypt (x, zero))) xs)
sumEnc2 xs = sum (map sumSegment ys)
where
ys = groupBy`{100} xs
sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)
sumEnc3 = mapReduce`{100} (\x -> aesEncrypt (x,zero)) (+) 0
sumEnc
has rather poor memory usage as the number of elements scales. After some time, the processor usage would periodically stall and the machine would swap for a while before saturating the cores again; I interpret this as major garbage collection events. I eventually had to cancel largeish jobs using this method before they finished.
This seems to be fixed in both sumEnc2
and sumEnc3
, which appear to have constant memory usage. They were both able to successfully run with an input argument of [1 .. 100000]
, which takes several hours to run. The job seemed to entirely saturate the 4 cores I allocated for the job the entire time. GC pauses appeared to be minimal and the machine did not have swapping issues.
The parmap
primitive in now included in master as of #792, which should allow some additional experimentation.
In #868, I've updated parmap
to reduce its arguments to normal form (NF) rather than just to weak head normal form (WHNF) and used a strictly-accumulating version of sum
. With this done, the performance difference between the naive sumEnc
above and the tricker mapReduce
version basically disappears (mapReduce still has a minor advantage, but it is quite small).
With #867 and #868 merged, I think we can consider this closed.
Is there anything holding back a multithreaded Cryptol interpreter? I think about this every time I write a
map
.