ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics β€” for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 πŸ“
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
988 stars 59 forks source link

Integrate use of `perfplot` package with benchmark #201

Closed jimthompson5802 closed 1 month ago

jimthompson5802 commented 1 month ago

Addresses Issue #193.

@ashvardanian Here is the start of the PR. I'm marking it in DRAFT mode because more work is needed on it.

I created a new module benchmark_perfplot.py as proof-of-concept of using perfplot. Right now much of using perfplot is hard-coded. For this proof-of-concept, the benchmark compares np.dot against simd.dot with one dtype. The plot shows timings for different values of ndim.

As I understand the structure for the original benchmark.py, I'm keeping the same structure. The timing is on how long it takes to process a batch of row vectors, i.e., if the batch is of size n and and each row is size ndim, then the time reported is the total time it took to process all of the n rows of the batch.

Here is a perfplot generated for this proof-of-concept. image

Let me know what you think.

jimthompson5802 commented 1 month ago

The perfplot specific code in the proof-of-concept, is located here: https://github.com/jimthompson5802/SimSIMD/blob/1d46ea782f0ae52041b6cbfcacd3cc1d59b63a6d/python/bench_perfplot.py#L312-L345

One open question: In perfplot it has the ability to compare for "equality". right now that function is turned off. I'm thinking would would want to make sure we are getting same numerical results. If this is the case, what is the basis for comparing for equality in the context of processing batch of row vectors?

ashvardanian commented 1 month ago

I think it may be a good idea to show relative speedups instead of row duration.

Alternatively, it may be a good idea to show performance (speedup) for different numeric types using NumPy/SciPy performance for double-precision as a baseline. So we have one chart for every kernel (be it dot products or cosine distances) with lines for every backend and every numeric type.

What do you think?

ashvardanian commented 1 month ago

@jimthompson5802 equality comparison logic in such kernels is very tricky and already handled by the test using fixtures. Let’s avoid overcomplicating the benchmark with equality checks.

jimthompson5802 commented 1 month ago

By speed up, I assume you mean speed up compared to the "baseline" function. In the proof-of-concept situation, the baseline function is np.dot and we should show the speed up for simd.dot for the different values of ndim.

Is this correct understanding?

ashvardanian commented 1 month ago

By speed up, I assume you mean speed up compared to the "baseline" function. In the proof-of-concept situation, the baseline function is np.dot and we should show the speed up for simd.dot for the different values of ndim.

Is this correct understanding?

Updated my comment above for clarity πŸ€—

jimthompson5802 commented 1 month ago

@ashvardanian I can get perfplot to produce a "speed-up" plot. However this requires two parameters, one of which seems like an arbitrary setting. With these settings, perfplot generates a plot like this.

image

For this plot, np.dot is designated as the "baseline" function, so it speed-up value is "1.0". All the others are relative to np.dot. In this situation, simd.dot is "faster" because its speed-up factor > 1 for most of the values of ndim. OTOH, spd.cosine is "slower" because speed-up factor < 1.

Is this closer to what you are expecting?

To get this plot, I had to make use of a parameter called flops=. the documentation about this parameter is sparse, non-existent AFAICT. In the perfplot sample code, this parameter is a Python Callable that returns, which I am guessing here, a value indicating the number of FLOPS for the size of the vector (ndim). The presence of this value and value for relative_to parameter causes perfplot to compute "speed-up" from time duration as follows: https://github.com/nschloe/perfplot/blob/f224a8959f16757f327bfc8f4aba59218f1d0a46/src/perfplot/_main.py#L116-L141

            else:
                flops = self.timings_s[relative_to] / self.timings_s
                plt.title(f"FLOPS relative to {self.labels[relative_to]}")

Smaller values for self.timings_s relative to self.timings_s[relative_to] will lead to values > 1.0.

I've looked but it appears we have little control over the appearance of the perfplot generated plot. I can't find a perfplot provide means to customize the plot. For example, if we wanted to change the plot title or add annotations to the plot, it is not clear how we could do that.

I'm pointing this out because to meet this requirement,

Alternatively, it may be a good idea to show performance (speedup) for different numeric types using NumPy/SciPy performance for double-precision as a baseline. So we have one chart for every kernel (be it dot products or cosine distances) with lines for every backend and every numeric type.

may require some control over how the plot is created.

Assuming the above sample plot is heading in the right direction, I'm now going to focus on how to do the requirement stated above.

ashvardanian commented 1 month ago

For this plot, np.dot is designated as the "baseline" function, so it speed-up value is "1.0". All the others are relative to np.dot. In this situation, simd.dot is "faster" because its speed-up factor > 1 for most of the values of ndim. OTOH, spd.cosine is "slower" because speed-up factor < 1.

We shouldn't put spd.cosine and simd.dot on the same chart. There should be a separate chart for every function name. In this case if the dot products are reviewed - the np.dot<float64> should be the baseline, but the chart can also feature np.dot<float16>, np.dot<float32>, np.dot<int8>, simd.dot<float64>, simd.dot<float32>, simd.dot<float16>, simd.dot<bfloat16>:

good idea to show performance (speedup) for different numeric types using NumPy/SciPy performance for double-precision as a baseline. So we have one chart for every kernel (be it dot products or cosine distances) with lines for every backend and every numeric type.

We may also want to use the vector size in bytes on the X axis, as opposed to dimensions. Can make a boolean option for that in the CLI πŸ€—

jimthompson5802 commented 1 month ago

@ashvardanian Here is an updated sample plot. For testing purposes I'm only using on 2 data types per function call. I believe this is more in-line with what you are looking for.

image

I still have to do more due diligence to confirm the speed-up computations are what you are looking for.

Let me know what you think.

jimthompson5802 commented 1 month ago

@ashvardanian I've incorporated what we discussed. I'm marking this ready for review. Here are summary of key changes:

Supported command-line args:

usage: bench_perfplot.py [-h] [--ndim NDIM] [--torch] [--tf] [--jax]

Benchmark SimSIMD

options:
  -h, --help   show this help message and exit
  --ndim NDIM  Size of vectors to benchmark, either 'default' powers of 2 (from 1 to 32K) or comma-seperated list of integers
  --torch      Profile PyTorch
  --tf         Profile TensorFlow
  --jax        Profile JAX

Here is output for default execution:

# Benchmarking SimSIMD

- Vector dimensions: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
- Hardware capabilities: serial, haswell
- SimSIMD version: 5.5.0
- NumPy version: 1.26.4
-- NumPy BLAS dependency: openblas64
-- NumPy LAPACK dependency: dep140213194937296

image

Here is output with command line options: --torch --tf --jax:

# Benchmarking SimSIMD

- Vector dimensions: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
- Hardware capabilities: serial, haswell
- SimSIMD version: 5.5.0
- NumPy version: 1.26.4
- PyTorch version: 2.4.1+cu121
- TensorFlow version: 2.17.0
- JAX version: 0.4.34
-- NumPy BLAS dependency: openblas64
-- NumPy LAPACK dependency: dep140213194937296

image

Some items for consideration:

jimthompson5802 commented 1 month ago

@ashvardanian I just realized I still have some test/debug code that affects timing of the function that is being tested. Changed the PR back to DRAFT. I should have that code cleaned up by tonight.

jimthompson5802 commented 1 month ago

@ashvardanian Finish code clean-up. Ready for your review.

With this clean, reduced overhead for collecting timings. Here is the updated default execution:

# Benchmarking SimSIMD

- Vector dimensions: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
- Plot file path: simsimd_speed_up.png
- Hardware capabilities: serial, haswell
- SimSIMD version: 5.5.0
- NumPy version: 1.26.4
-- NumPy BLAS dependency: openblas64
-- NumPy LAPACK dependency: dep140213194937296

image

You should notice a significant change in the reported speed-up factors for the lower range of ndim values.

I added two additional cli options:

Here are all the cli options:

usage: bench_perfplot.py [-h] [--ndim NDIM] [--torch] [--tf] [--jax] [--plot_fp PLOT_FP] [--debug]

Benchmark SimSIMD

options:
  -h, --help         show this help message and exit
  --ndim NDIM        Size of vectors to benchmark, either 'default' powers of 2 (from 1 to 32K) or comma-seperated list of
                     integers
  --torch            Profile PyTorch
  --tf               Profile TensorFlow
  --jax              Profile JAX
  --plot_fp PLOT_FP  File to save the plot to, default: 'simsimd_speed_up.png'
  --debug            Provide additional debug information