Nugine / simd

SIMD-accelerated operations
MIT License
194 stars 9 forks source link

Parallel API #24

Open Nugine opened 1 year ago

Nugine commented 1 year ago

Improve throughput by using multi-threading (rayon).

In most cases, the simd functions are fast enough that we may not benefit from multi-threading. And some functions are not divisible.

Crates with parallel api:

Nugine commented 1 year ago

Parallel threshold

$$ \frac{n}{pv}+c<\frac{n}{v} $$

$$ n > \frac{p}{p-1}cv $$

$n$: input size (B) $p$: num threads $v$: throughput (B/ns) $c$: threading overhead (ns)

Parallel time rate (smaller is better)

$$ \frac{n/(pv)+c}{n/v} < 1 $$

$$ \frac{1}{p}+\frac{cv}{n} < 1 $$

When handling large input, the throughput may be limited by cache miss and page faults. In other words, the throughput hits memory bound. So the theoretical parallel time rate is smaller than the practial one.

Nugine commented 1 year ago
just bench static-experimental --bench base64 --plotting-backend disabled -- 'base64-encode/base64-simd'

base64-encode (GiB/s)

16 32 64 256 1024 4096 65536 262144 524288 1048576
base64-simd/auto 1.827 1.977 3.503 8.008 11.826 12.980 13.494 12.852 12.751 12.723
base64-simd/parallel 1.174 0.004 0.008 0.029 0.114 0.433 5.623 16.196 24.001 31.267
pickfire commented 2 weeks ago

Would it be useful to have option to parse multiple items, which the cache and instruction level parallelism can probably help here.

Use case will be something like being able to use it in polars to read a column of uuid.

Nugine commented 2 weeks ago

Would it be useful to have option to parse multiple items, which the cache and instruction level parallelism can probably help here.

Use case will be something like being able to use it in polars to read a column of uuid.

Could you explain this use case in more detail? How can we accelerate it by multithreading?