Closed vijay1057 closed 6 months ago
The single pair you have used are given by us as a part of the demo we have shown. So please understand I have ran that and a LOT more pairs across at-least 25 different models before picking the best.
This is not how you test for precision or time performance. This is called cherry-picking. You have to pick a large sample space in English for English only models and other languages for the multilingual model.
While Multilingual model supports English it is ideally suited for non-English languages. Why are you even trying English pairs with multilingual ? If nano and small are great use then for English.
Also with multilingual model you can do cross query Fx English query and Spanish passages might work for few samples but generally not recommended.
We wanted to offer the best blend of speed, lightness and precision. So we chose cross encoders which had the best performance from their training on MS-Marco dataset from Microsoft.
I use nano and small in production, works fine.
Thank you for prompt response Prithivi.
For English, our quick experiment showed the nano model has provided good results with less time.
Hope I have provided the context where we were using them. Once again thank you for the reponse.
Ms-Marco is largely based on Bing queries from Microsoft. So if your domain is even slightly niche than any web quality english text it won't work.
While these models will help most general purpose use-cases, I recommend you to fine tune your own cross encoder on your data and not use off-the-shelf Rerank models for best Reranking precision or embedding recall for that matter.
If latency is your priority using medium model even if performed well is not going to help. Nano can do 9000 pairs / sec, while medium can do 330 pairs / sec.
Incase if you need advisory or consulting on bespoke model effort reach-out via LinkedIn,
Either way I wish you all the best.
Hi! Thank you for open sourcing a sleek & wonderful package. We performed couple of tests, noticed nano and small were giving a good (expected) results, while medium (multi-lingual) is not providing us the good results.
Please find below results of nano/small & medium for the example available in readme page: query = "Tricks to accelerate LLM inference" passages = [ "Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.", "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper", "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call
model.to_bettertransformer()
on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. ", "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. ", "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ]Results: Nano/Small: [{'score': 0.9957617, 'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'}, {'score': 0.9336851, 'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call
model.to_bettertransformer()
on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "}, {'score': 0.50486594, 'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'}, {'score': 0.3989764, 'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}, {'score': 0.05916641, 'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}]Medium: [{'score': 0.9666068, 'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call
model.to_bettertransformer()
on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "}, {'score': 0.9641034, 'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'}, {'score': 0.9625791, 'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'}, {'score': 0.95415944, 'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}, {'score': 0.9465828, 'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}]As we can notice, the scores in medium are high and not much varied.
Please guide me, if I missed anything while using. Thank you