cloudflare / circl

CIRCL: Cloudflare Interoperable Reusable Cryptographic Library
http://blog.cloudflare.com/introducing-circl
Other
1.22k stars 136 forks source link

k12: improve API and support multithreaded computation #443

Open bwesterb opened 1 year ago

bwesterb commented 1 year ago

Use options style API, so that the common case is very simple:

h := k12.NewDraft10()

but we can provide options elegantly:

h := k12.NewDraft10(
    WithContext([]byte("some context")),
    WithWorkers(runtime.NumCPU()),
)

Allows multithreaded computation with the WithWorkers() option. On M2 Pro scales well with a few workers, but isn't able to utilize all cores effectively:

BenchmarkK12_100B-12             5223334           228.8 ns/op   437.07 MB/s
BenchmarkK12_10K-12               105744         11183 ns/op     894.21 MB/s
BenchmarkK12_100K-12               27141         44364 ns/op    2254.06 MB/s
BenchmarkK12_3M-12                  1172       1010401 ns/op    3243.07 MB/s
BenchmarkK12_32M-12                  100      10033730 ns/op    3265.78 MB/s
BenchmarkK12_327M-12                  12      98888208 ns/op    3313.64 MB/s
BenchmarkK12_3276M-12                  2     991111938 ns/op    3306.19 MB/s
BenchmarkK12x2_32M-12                183       6510423 ns/op    5033.16 MB/s
BenchmarkK12x2_327M-12                18      63622058 ns/op    5150.41 MB/s
BenchmarkK12x2_3276M-12                2     632823584 ns/op    5178.06 MB/s
BenchmarkK12x4_32M-12                364       3300120 ns/op    9929.34 MB/s
BenchmarkK12x4_327M-12                39      29477854 ns/op    11116.14 MB/s
BenchmarkK12x4_3276M-12                2     581819167 ns/op    11263.98 MB/s
BenchmarkK12x8_32M-12                520       2301923 ns/op    14235.05 MB/s
BenchmarkK12x8_327M-12                76      15590312 ns/op    21018.18 MB/s
BenchmarkK12x8_3276M-12                4     296590469 ns/op    22096.46 MB/s
BenchmarkK12xCPUs_32M-12             472       2526827 ns/op    12968.04 MB/s
BenchmarkK12xCPUs_327M-12             78      15139957 ns/op    21643.39 MB/s
BenchmarkK12xCPUs_3276M-12             4     280958114 ns/op    23325.90 MB/s

We only reach 23GB/s (at 12x) instead of the lower bound of 33GB/s expected with 10 performance cores.

Adds {Max,Next}WriteSize to suggest the caller how big to choose their Write() calls.

bwesterb commented 1 year ago

requires message of many GB in memory. Not sure if this is gonna be realistic since usually large reads are done in batches of smaller sizes.

Those tests don't use a buffer of GBs. Instead they use the size suggested by MaxWriteSize(), which atm is lanes CPUs 8kB, ie. 192kB on M2 Pro.

but timings make me think if something must be done differently (or I missed something)

There clearly is some concurrency bottleneck here which I haven't figured out yet.