Throughput (exposed as Queries per second (1/s)) needs to be measured and not derived from latency (1.0 / attrs["best_search_time"])

filipecosta90 commented 1 year ago

In theory, if a system has a latency of 1 millisecond for a request, one might think the system can handle 1000 (1/0.001) requests per second, implying a direct correlation between latency and throughput (considering the throughput as the inverse of latency). However, in real-world scenarios, this correlation does not hold due to several reasons:

Concurrency: Many systems can process multiple requests simultaneously. This means that while the system may take 1ms to respond to a single request, it can also process hundreds or thousands of other requests within that same 1ms timeframe, assuming sufficient resources are available. We can argue that this is not much of a concern in this type of single client benchmarks but please read the points bellow.
Resource Saturation: A system may experience an increase in latency as the load increases due to the saturation of system resources. This could be due to CPU limits, memory, network bandwidth, or I/O capacity, among others. The system's throughput would level off and could even decrease under extreme load conditions, even while latency continues to rise. If we just focus on the best search time like we're doing now we're not really measuring the implications of a long running benchmark on the resulting throughput.

Queuing Delays: In many systems, requests are queued before processing. Even if a single request can be processed quickly, if many requests are lined up for processing, the overall latency will be higher. This can happen even if the system's throughput is high.
- Network and System Behavior: Latency can also be affected by network factors like propagation delays, transmission medium, routing, etc., or system-level factors like hardware architecture, software efficiency, etc. These factors might not affect the throughput in the same proportion, creating disparities between the two. Let's consider cloud deployments that allow for a burst of requests and then reduce the achievable number of requests per second. The reported throughput would not be representative the "common-case" for the entire benchamrk run.

For these reasons, to have a comprehensive understanding of a system's performance, we need to measure both latency and throughput. The current approach is too optimistic and not representative of the a system's performance over time -- we're doing long running benchmarks and only focusing on the best tini-tiny portion to deduce throughput. Only by considering both of these metrics can we understand how well a system responds to individual requests and how effectively it processes a large volume of requests over time.

With the above in mind I would like to propose a PR for this tool that keeps track of throughput from start to end and uses the median/common case value as the reported "Queries per second" value. Agree?

erikbern commented 1 year ago

Given that the benchmarks are single threaded and serving exactly one request at any time, I think throughput is exactly the inverse of latency?

I agree that this isn't exactly what people care about in a real world setting, but I think it would be quite complex to extend the benchmarks. You would have to make assumptions about the arrival process – both the overall rate and the distribution. So for instance wrt the arrival process, do you assume it's exponential or do you make some other assumption (bursty, like you mention).

You would have to introduce a whole new axis in the benchmark and plot latency vs arrival rate. Plotting the tradeoff between arrival rate and latency would show the classic behavior where latency spikes to infinity as the arrival rate goes asymptotically towards the upper capacity.

All of this is doable, but at the cost of

Making the numbers a lot more hard to parse
Making the simulations a lot more expensive – you have a whole new axis to grid search
Making the code more complex
Introduce a whole new set of assumptions (arrival distributions etc)

My feeling is that it's worth simplifying reality a bit, and keeping the benchmarks single-threaded.

affansyed commented 1 month ago

@erikbern I am not sure I understand why arrival process is of concern here -- the proposal seems like just measuring the throughput experienced during a single run; the main variation in the single threaded case would only be (if my understanding is on point) due to point 2 and 3 of @filipecosta90 argument.

Now , tbf, those variations are typically implementation dependent; ANN benchmark seem to value algorithmic comparison over the implementation details (although implementations of same algorithms do get compared). I would wager that could be a reason to keep it the way it is; but alternatively we dont know how small of change @filipecosta90 PR would have been (might be just a small change to compute the throughput).

erikbern commented 1 month ago

It would be a pretty big change to have a more complex arrival process and have concurrent workers. You would have to simulate arrivals (using several assumptions). In addition instead of a simple relationship between latency and throughput, you would get a tradeoff curve which makes things a lot more complicated than just a singular value (and a lot more computationally expensive – you'd have to simulate n times with different arrival rates). So for these reasons it's not something I think makes sense for ann-benchmarks!

affansyed commented 1 month ago

I guess I can't speak on behalf of @filipecosta90, but if it felt like we can keep it single threaded and change nothing else, just monitor the throughput experienced across all queries (i.e Total # of queries / run_time). That metric alone might be more "realistic".

But on the flip side, I do see your point regarding a more holistic approach to solving it with concurrent workers. I commented as I found the observation above relevant and was just clarifying my understanding.

erikbern / ann-benchmarks

Throughput (exposed as Queries per second (1/s)) needs to be measured and not derived from latency (1.0 / attrs["best_search_time"]) #423