Closed jimthompson5802 closed 1 month ago
The perfplot
specific code in the proof-of-concept, is located here:
https://github.com/jimthompson5802/SimSIMD/blob/1d46ea782f0ae52041b6cbfcacd3cc1d59b63a6d/python/bench_perfplot.py#L312-L345
One open question: In perfplot
it has the ability to compare for "equality". right now that function is turned off. I'm thinking would would want to make sure we are getting same numerical results. If this is the case, what is the basis for comparing for equality in the context of processing batch of row vectors?
I think it may be a good idea to show relative speedups instead of row duration.
Alternatively, it may be a good idea to show performance (speedup) for different numeric types using NumPy/SciPy performance for double-precision as a baseline. So we have one chart for every kernel (be it dot products or cosine distances) with lines for every backend and every numeric type.
What do you think?
@jimthompson5802 equality comparison logic in such kernels is very tricky and already handled by the test using fixtures. Letβs avoid overcomplicating the benchmark with equality checks.
By speed up, I assume you mean speed up compared to the "baseline" function. In the proof-of-concept situation, the baseline function is np.dot
and we should show the speed up for simd.dot
for the different values of ndim
.
Is this correct understanding?
By speed up, I assume you mean speed up compared to the "baseline" function. In the proof-of-concept situation, the baseline function is
np.dot
and we should show the speed up forsimd.dot
for the different values ofndim
.Is this correct understanding?
Updated my comment above for clarity π€
@ashvardanian I can get perfplot
to produce a "speed-up" plot. However this requires two parameters, one of which seems like an arbitrary setting. With these settings, perfplot
generates a plot like this.
For this plot, np.dot
is designated as the "baseline" function, so it speed-up value is "1.0". All the others are relative to np.dot
. In this situation, simd.dot
is "faster" because its speed-up factor > 1 for most of the values of ndim
. OTOH, spd.cosine
is "slower" because speed-up factor < 1.
Is this closer to what you are expecting?
To get this plot, I had to make use of a parameter called flops=
. the documentation about this parameter is sparse, non-existent AFAICT. In the perfplot
sample code, this parameter is a Python Callable that returns, which I am guessing here, a value indicating the number of FLOPS for the size of the vector (ndim
). The presence of this value and value for relative_to
parameter causes perfplot
to compute "speed-up" from time duration as follows: https://github.com/nschloe/perfplot/blob/f224a8959f16757f327bfc8f4aba59218f1d0a46/src/perfplot/_main.py#L116-L141
else:
flops = self.timings_s[relative_to] / self.timings_s
plt.title(f"FLOPS relative to {self.labels[relative_to]}")
Smaller values for self.timings_s
relative to self.timings_s[relative_to]
will lead to values > 1.0.
I've looked but it appears we have little control over the appearance of the perfplot
generated plot. I can't find a perfplot
provide means to customize the plot. For example, if we wanted to change the plot title or add annotations to the plot, it is not clear how we could do that.
I'm pointing this out because to meet this requirement,
Alternatively, it may be a good idea to show performance (speedup) for different numeric types using NumPy/SciPy performance for double-precision as a baseline. So we have one chart for every kernel (be it dot products or cosine distances) with lines for every backend and every numeric type.
may require some control over how the plot is created.
Assuming the above sample plot is heading in the right direction, I'm now going to focus on how to do the requirement stated above.
For this plot,
np.dot
is designated as the "baseline" function, so it speed-up value is "1.0". All the others are relative tonp.dot
. In this situation,simd.dot
is "faster" because its speed-up factor > 1 for most of the values ofndim
. OTOH,spd.cosine
is "slower" because speed-up factor < 1.
We shouldn't put spd.cosine
and simd.dot
on the same chart. There should be a separate chart for every function name. In this case if the dot products are reviewed - the np.dot<float64>
should be the baseline, but the chart can also feature np.dot<float16>
, np.dot<float32>
, np.dot<int8>
, simd.dot<float64>
, simd.dot<float32>
, simd.dot<float16>
, simd.dot<bfloat16>
:
good idea to show performance (speedup) for different numeric types using NumPy/SciPy performance for double-precision as a baseline. So we have one chart for every kernel (be it dot products or cosine distances) with lines for every backend and every numeric type.
We may also want to use the vector size in bytes on the X axis, as opposed to dimensions. Can make a boolean option for that in the CLI π€
@ashvardanian Here is an updated sample plot. For testing purposes I'm only using on 2 data types per function call. I believe this is more in-line with what you are looking for.
I still have to do more due diligence to confirm the speed-up computations are what you are looking for.
Let me know what you think.
@ashvardanian I've incorporated what we discussed. I'm marking this ready for review. Here are summary of key changes:
dot
operation for numpy, simsimd, tensorflow, torch and jax.simsimd_speed_up_numpy.dot(f64).png
Supported command-line args:
usage: bench_perfplot.py [-h] [--ndim NDIM] [--torch] [--tf] [--jax]
Benchmark SimSIMD
options:
-h, --help show this help message and exit
--ndim NDIM Size of vectors to benchmark, either 'default' powers of 2 (from 1 to 32K) or comma-seperated list of integers
--torch Profile PyTorch
--tf Profile TensorFlow
--jax Profile JAX
Here is output for default execution:
# Benchmarking SimSIMD
- Vector dimensions: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
- Hardware capabilities: serial, haswell
- SimSIMD version: 5.5.0
- NumPy version: 1.26.4
-- NumPy BLAS dependency: openblas64
-- NumPy LAPACK dependency: dep140213194937296
Here is output with command line options: --torch --tf --jax
:
# Benchmarking SimSIMD
- Vector dimensions: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
- Hardware capabilities: serial, haswell
- SimSIMD version: 5.5.0
- NumPy version: 1.26.4
- PyTorch version: 2.4.1+cu121
- TensorFlow version: 2.17.0
- JAX version: 0.4.34
-- NumPy BLAS dependency: openblas64
-- NumPy LAPACK dependency: dep140213194937296
Some items for consideration:
perfplot
does not seem to allow the user to control plot appearance, i.e., unable to change type of line or color or title of plot. If this level of control is needed, we have to work outside of perfplot
to display the results. We can still use perfplot
to derive the timings. I've looked at perfplot
's internal data structures and can easily extract the timings. From the extracted timings, we can generate the plots in our own code.@ashvardanian I just realized I still have some test/debug code that affects timing of the function that is being tested. Changed the PR back to DRAFT. I should have that code cleaned up by tonight.
@ashvardanian Finish code clean-up. Ready for your review.
With this clean, reduced overhead for collecting timings. Here is the updated default execution:
# Benchmarking SimSIMD
- Vector dimensions: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
- Plot file path: simsimd_speed_up.png
- Hardware capabilities: serial, haswell
- SimSIMD version: 5.5.0
- NumPy version: 1.26.4
-- NumPy BLAS dependency: openblas64
-- NumPy LAPACK dependency: dep140213194937296
You should notice a significant change in the reported speed-up factors for the lower range of ndim
values.
I added two additional cli options:
plot_fp
: specifies the output plot file name. Default is simsimd_speed_up.png
debug
: provides useful information when testing new kernels. This will effect speed-up factors, especially on the smaller range of ndim
, as noted earlier.Here are all the cli options:
usage: bench_perfplot.py [-h] [--ndim NDIM] [--torch] [--tf] [--jax] [--plot_fp PLOT_FP] [--debug]
Benchmark SimSIMD
options:
-h, --help show this help message and exit
--ndim NDIM Size of vectors to benchmark, either 'default' powers of 2 (from 1 to 32K) or comma-seperated list of
integers
--torch Profile PyTorch
--tf Profile TensorFlow
--jax Profile JAX
--plot_fp PLOT_FP File to save the plot to, default: 'simsimd_speed_up.png'
--debug Provide additional debug information
Addresses Issue #193.
@ashvardanian Here is the start of the PR. I'm marking it in DRAFT mode because more work is needed on it.
I created a new module
benchmark_perfplot.py
as proof-of-concept of usingperfplot
. Right now much of usingperfplot
is hard-coded. For this proof-of-concept, the benchmark comparesnp.dot
againstsimd.dot
with one dtype. The plot shows timings for different values ofndim
.As I understand the structure for the original
benchmark.py
, I'm keeping the same structure. The timing is on how long it takes to process a batch of row vectors, i.e., if the batch is of sizen
and and each row is sizendim
, then the time reported is the total time it took to process all of then
rows of the batch.Here is a
perfplot
generated for this proof-of-concept.Let me know what you think.