Right now only float is vectorized, other specializations will be added in a subsequent
commit.
To benchmark the query, I've modified flann_example_cpp to run 1000
query loops instead of just one.
Before this change, computing the norm is 36.4% of the execution. Even
though the loop is unrolled by a factor 4, loads, additions and
multiplications are still scalar. After the change, the loop is
vectorized. When the max distance is given, we still have to reduce
at every iteration to compare. Else, we only need a single reduce at the
end.
In the former case (worst_dist >= 0), computing the norm becomes 29.8% of execution. Execution time drops from 35.1 to 31.8s (10% improvement).
In the latter case (worst_dist < 0), computing the norm becomes 24.2% of execution. Execution time drops from 35.1 to 28.2 (20% improvement).
Right now only float is vectorized, other specializations will be added in a subsequent commit.
To benchmark the query, I've modified flann_example_cpp to run 1000 query loops instead of just one.
Before this change, computing the norm is 36.4% of the execution. Even though the loop is unrolled by a factor 4, loads, additions and multiplications are still scalar. After the change, the loop is vectorized. When the max distance is given, we still have to reduce at every iteration to compare. Else, we only need a single reduce at the end. In the former case (worst_dist >= 0), computing the norm becomes 29.8% of execution. Execution time drops from 35.1 to 31.8s (10% improvement). In the latter case (worst_dist < 0), computing the norm becomes 24.2% of execution. Execution time drops from 35.1 to 28.2 (20% improvement).
Before:
After (worst_dist >= 0):
After (worst_dist < 0):