k12: improve API and support multithreaded computation

Use options style API, so that the common case is very simple:

h := k12.NewDraft10()

but we can provide options elegantly:

h := k12.NewDraft10(
    WithContext([]byte("some context")),
    WithWorkers(runtime.NumCPU()),
)

Allows multithreaded computation with the WithWorkers() option. On M2 Pro scales well with a few workers, but isn't able to utilize all cores effectively:

BenchmarkK12_100B-12             5223334           228.8 ns/op   437.07 MB/s
BenchmarkK12_10K-12               105744         11183 ns/op     894.21 MB/s
BenchmarkK12_100K-12               27141         44364 ns/op    2254.06 MB/s
BenchmarkK12_3M-12                  1172       1010401 ns/op    3243.07 MB/s
BenchmarkK12_32M-12                  100      10033730 ns/op    3265.78 MB/s
BenchmarkK12_327M-12                  12      98888208 ns/op    3313.64 MB/s
BenchmarkK12_3276M-12                  2     991111938 ns/op    3306.19 MB/s
BenchmarkK12x2_32M-12                183       6510423 ns/op    5033.16 MB/s
BenchmarkK12x2_327M-12                18      63622058 ns/op    5150.41 MB/s
BenchmarkK12x2_3276M-12                2     632823584 ns/op    5178.06 MB/s
BenchmarkK12x4_32M-12                364       3300120 ns/op    9929.34 MB/s
BenchmarkK12x4_327M-12                39      29477854 ns/op    11116.14 MB/s
BenchmarkK12x4_3276M-12                2     581819167 ns/op    11263.98 MB/s
BenchmarkK12x8_32M-12                520       2301923 ns/op    14235.05 MB/s
BenchmarkK12x8_327M-12                76      15590312 ns/op    21018.18 MB/s
BenchmarkK12x8_3276M-12                4     296590469 ns/op    22096.46 MB/s
BenchmarkK12xCPUs_32M-12             472       2526827 ns/op    12968.04 MB/s
BenchmarkK12xCPUs_327M-12             78      15139957 ns/op    21643.39 MB/s
BenchmarkK12xCPUs_3276M-12             4     280958114 ns/op    23325.90 MB/s

We only reach 23GB/s (at 12x) instead of the lower bound of 33GB/s expected with 10 performance cores.

Adds {Max,Next}WriteSize to suggest the caller how big to choose their Write() calls.

cloudflare / circl

k12: improve API and support multithreaded computation #443