erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.88k stars 735 forks source link

Throughput (exposed as Queries per second (1/s)) needs to be measured and not derived from latency (1.0 / attrs["best_search_time"]) #423

Open filipecosta90 opened 1 year ago

filipecosta90 commented 1 year ago

In theory, if a system has a latency of 1 millisecond for a request, one might think the system can handle 1000 (1/0.001) requests per second, implying a direct correlation between latency and throughput (considering the throughput as the inverse of latency). However, in real-world scenarios, this correlation does not hold due to several reasons:

For these reasons, to have a comprehensive understanding of a system's performance, we need to measure both latency and throughput. The current approach is too optimistic and not representative of the a system's performance over time -- we're doing long running benchmarks and only focusing on the best tini-tiny portion to deduce throughput. Only by considering both of these metrics can we understand how well a system responds to individual requests and how effectively it processes a large volume of requests over time.

With the above in mind I would like to propose a PR for this tool that keeps track of throughput from start to end and uses the median/common case value as the reported "Queries per second" value. Agree?

erikbern commented 1 year ago

Given that the benchmarks are single threaded and serving exactly one request at any time, I think throughput is exactly the inverse of latency?

I agree that this isn't exactly what people care about in a real world setting, but I think it would be quite complex to extend the benchmarks. You would have to make assumptions about the arrival process – both the overall rate and the distribution. So for instance wrt the arrival process, do you assume it's exponential or do you make some other assumption (bursty, like you mention).

You would have to introduce a whole new axis in the benchmark and plot latency vs arrival rate. Plotting the tradeoff between arrival rate and latency would show the classic behavior where latency spikes to infinity as the arrival rate goes asymptotically towards the upper capacity.

All of this is doable, but at the cost of

  1. Making the numbers a lot more hard to parse
  2. Making the simulations a lot more expensive – you have a whole new axis to grid search
  3. Making the code more complex
  4. Introduce a whole new set of assumptions (arrival distributions etc)

My feeling is that it's worth simplifying reality a bit, and keeping the benchmarks single-threaded.

affansyed commented 1 month ago

@erikbern I am not sure I understand why arrival process is of concern here -- the proposal seems like just measuring the throughput experienced during a single run; the main variation in the single threaded case would only be (if my understanding is on point) due to point 2 and 3 of @filipecosta90 argument.

Now , tbf, those variations are typically implementation dependent; ANN benchmark seem to value algorithmic comparison over the implementation details (although implementations of same algorithms do get compared). I would wager that could be a reason to keep it the way it is; but alternatively we dont know how small of change @filipecosta90 PR would have been (might be just a small change to compute the throughput).

erikbern commented 1 month ago

It would be a pretty big change to have a more complex arrival process and have concurrent workers. You would have to simulate arrivals (using several assumptions). In addition instead of a simple relationship between latency and throughput, you would get a tradeoff curve which makes things a lot more complicated than just a singular value (and a lot more computationally expensive – you'd have to simulate n times with different arrival rates). So for these reasons it's not something I think makes sense for ann-benchmarks!

affansyed commented 1 month ago

I guess I can't speak on behalf of @filipecosta90, but if it felt like we can keep it single threaded and change nothing else, just monitor the throughput experienced across all queries (i.e Total # of queries / run_time). That metric alone might be more "realistic".

But on the flip side, I do see your point regarding a more holistic approach to solving it with concurrent workers. I commented as I found the observation above relevant and was just clarifying my understanding.