ashvardanian / SimSIMD

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
794 stars 42 forks source link

AVX2 and SSE4 implementations #132

Open MBkkt opened 3 weeks ago

MBkkt commented 3 weeks ago

Most of your code for x86_64 under skylake ifdef

Not all servers have very modern CPU, haswell and westmere overloads exists for simdjson and simdutf for an example

Examples under Apache 2.0: https://github.com/google-research/google-research/tree/master/scann/scann/distance_measures/one_to_one https://github.com/ydb-platform/ydb/tree/main/library/cpp/dot_product https://github.com/ydb-platform/ydb/tree/main/library/cpp/l1_distance https://github.com/ydb-platform/ydb/tree/main/library/cpp/l2_distance

Also lot of code compiled with -mpreferred-vector-width=256, because downclocking issues

Such thing can be detected in runtime to choose smaller vector size before ice/rocket lake https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

ashvardanian commented 3 weeks ago

Hi @MBkkt, thanks for the links, very useful! We have a lot of kernels for SIMSIMD_TARGET_HASWELL, potentially avoiding f32 overloads in some cases. Smaller registers and single/double-precision types don't often yield substantial boosts. Which kernels and CPU generations are you most interested in?

Also lot of code compiled with -mpreferred-vector-width=256, because downclocking issues Such thing can be detected in runtime to choose smaller vector size before ice/rocket lake https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

This might be an interesting addition. We can probably add a helper function to detect potential down-clocking, that users will pass as allowed to the following dynamic dispatch function:

/**
 *  @brief  Determines the best suited metric implementation based on the given datatype,
 *          supported and allowed by hardware capabilities.
 *
 *  @param kind The kind of metric to be evaluated.
 *  @param datatype The data type for which the metric needs to be evaluated.
 *  @param supported The hardware capabilities supported by the CPU.
 *  @param allowed The hardware capabilities allowed for use.
 *  @param metric_output Output variable for the selected similarity function.
 *  @param capability_output Output variable for the utilized hardware capabilities.
 */
SIMSIMD_PUBLIC void simsimd_find_metric_punned( //
    simsimd_metric_kind_t kind,                 //
    simsimd_datatype_t datatype,                //
    simsimd_capability_t supported,             //
    simsimd_capability_t allowed,               //
    simsimd_metric_punned_t* metric_output,     //
    simsimd_capability_t* capability_output) {

Would that make sense? Can you contribute it?

MBkkt commented 3 weeks ago

Would that make sense?

Sounds like good solution to me.

Can you contribute it?

Probably not now :(