PrithivirajDamodaran / FlashRank

Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cross-encoders and more. Created by Prithivi Da, open for PRs & Collaborations.
Apache License 2.0
441 stars 37 forks source link

Multi-lingual model results are not as expected #1

Closed vijay1057 closed 6 months ago

vijay1057 commented 6 months ago

Hi! Thank you for open sourcing a sleek & wonderful package. We performed couple of tests, noticed nano and small were giving a good (expected) results, while medium (multi-lingual) is not providing us the good results.

Please find below results of nano/small & medium for the example available in readme page: query = "Tricks to accelerate LLM inference" passages = [ "Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.", "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper", "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. ", "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. ", "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ]

Results: Nano/Small: [{'score': 0.9957617, 'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'}, {'score': 0.9336851, 'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "}, {'score': 0.50486594, 'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'}, {'score': 0.3989764, 'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}, {'score': 0.05916641, 'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}]

Medium: [{'score': 0.9666068, 'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "}, {'score': 0.9641034, 'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'}, {'score': 0.9625791, 'passage': 'Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'}, {'score': 0.95415944, 'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}, {'score': 0.9465828, 'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}]

As we can notice, the scores in medium are high and not much varied.

Please guide me, if I missed anything while using. Thank you

PrithivirajDamodaran commented 6 months ago

I use nano and small in production, works fine.

vijay1057 commented 6 months ago

Thank you for prompt response Prithivi.

For English, our quick experiment showed the nano model has provided good results with less time.

Hope I have provided the context where we were using them. Once again thank you for the reponse.

PrithivirajDamodaran commented 6 months ago

Ms-Marco is largely based on Bing queries from Microsoft. So if your domain is even slightly niche than any web quality english text it won't work.

While these models will help most general purpose use-cases, I recommend you to fine tune your own cross encoder on your data and not use off-the-shelf Rerank models for best Reranking precision or embedding recall for that matter.

If latency is your priority using medium model even if performed well is not going to help. Nano can do 9000 pairs / sec, while medium can do 330 pairs / sec.

Incase if you need advisory or consulting on bespoke model effort reach-out via LinkedIn,

Either way I wish you all the best.