Discrepancy in CPU Inference Latency: Cross-Encoder MiniLM Models vs. ColBERT

bclavie / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.

Apache License 2.0

2.45k stars 173 forks source link

Greetings :wave:

I've been benchmarking the CPU inference latency for various models and observed some significant differences. Specifically, I'm comparing the performance of the sentence_transformers 'cross-encoder/ms-marco-MiniLM-L-12-v2' with other models. The latency for the top 10 re-ranking seems to vary quite a bit, and I'm trying to understand if this is an expected behavior or if there might be an issue with my setup. For clarity, here's a quick summary of the latencies I've recorded:

ColBERT: 620 ms
Cross-encoder MiniLM-L-12-v2: 309 ms
Cross-encoder MiniLM-L-6-v2: 150 ms I've attached a visual representation of these findings for reference:

image (5)

Could someone please shed some light on this? Is there a particular reason why the ColBERT model has over double the latency of MiniLM-L-12-v2? Any insights or suggestions for improving the inference speed of ColBert on CPU would be greatly appreciated.

bclavie / RAGatouille

Discrepancy in CPU Inference Latency: Cross-Encoder MiniLM Models vs. ColBERT #190