Thoughts about design philosophy of RankLLM

lintool commented 2 months ago

What is RankLLM? I can think of two obvious answers:

Approach 1. RankLLM is a fully-integrated layer on top of Anserini and Pyserini.

If this is the case, then we need "deep" integration with Pyserini, pulling it in as a dependency (perhaps parts of it optional, etc.). Iteration would need to be coupled with Pyserini, and likely slower.

Approach 2. RankLLM is a lightweight general-purpose reranking library.

Basically, we can rerank anything... just give us something in this JSON format, and we'll rerank it for you. By the way, you can get the candidates from Pyserini, here's the command you run.

In this case, RankLLM does not need to have Pyserini as a dependency. We just need shim code in Pyserini to get its output into the right format. And for that, Anserini directly also.

Integration is not as tight - but this simplifies dependencies quite a bit...

Thoughts about these two approaches @ronakice @sahel-sh ?

sahel-sh commented 2 months ago

Copy pasting comments from our slack discussion:

Ronak: "yup these are two directions we can take it in. I am not sure what do you prefer @Sahel Sharifymoghaddam ? I think people probably want more in the Approach 2 direction. For me I always run baselines in Anserini/Pyserini so approach 1 is completely fine too;"

Sahel: "When I decided to keep it as a separate repo, I had option 2 in mind as well. That's why it has a pyserini retriever. I see pros and cons for both approaches. My main concern about the #1 is expanding rankllm. For example adding training and other types of ranking prompts like pointwise. I think having it separate might make it easier to expand. I think Pyserini on its own is large enough and expanding. But I also see for us as a lab, a cohesive retrieval and rerank can be an umbrella for everything, for example repllama and rankllama. I personally prefer #2 for easier maintenance and greater visibility. But I don't think we should decide based on that. The main question is: moving forward what would be the main usage of rankllm. If it is some basic retrieval with study of llms for ranking, it is fine as is.(I.e.a pyserini wrapper inside rankllm repo for an optional retrieval, or bring your own retrieved data like Ronak does, or heavily caching/storing retrieved results) If you think retrieval would be equally important to our users, then maybe keeping it in the same repo as pyserini guarantees better feature parity. Like some new retriever would be directly available for reranking too)"

sahel-sh commented 2 months ago

Current state of the design is available in these examples:

https://github.com/castorini/rank_llm/blob/main/src/rank_llm/demo/rerank_dataset_with_prebuilt_index.py
https://github.com/castorini/rank_llm/blob/main/src/rank_llm/demo/rerank_dataset_with_custom_index.py use the thin pyserini wrapper to assist the user in bringing data. But this is optional, and as an alternative, users can use Pyserini or any other retriever independently to bring their retrieved data in a json file:
https://github.com/castorini/rank_llm/blob/main/src/rank_llm/demo/rerank_stored_retrieved_results.py

Comment from @lintool on the current design: I like calling approach 2 "bring your own data". The current design is worst of both worlds in the sense that (1) it's difficult for us to maintain, (2) user doesn't know what to do

ronakice commented 2 months ago

Just dropping thoughts here:

My main concern about the Option 1 is expanding rankllm. For example adding training and other types of ranking prompts like pointwise. I think having it separate might make it easier to expand. I think Pyserini on its own is large enough and expanding.

I'm not sure if one would hold it down generally (besides training). With training specifically, I think the dependency charts with pyserini will be affected (because I think training frameworks quickly go through torch/transformers versions while Pyserini is slower). One issue with training addition is that additionally, we'll have to benchmark our models whenever these changes are made, especially if 2CR pages.

I personally prefer https://github.com/castorini/rank_llm/pull/2 for easier maintenance and greater visibility. But I don't think we should decide based on that. The main question is: moving forward what would be the main usage of rankllm. If it is some basic retrieval with study of llms for ranking, it is fine as is.(I.e.a pyserini wrapper inside rankllm repo for an optional retrieval, or bring your own retrieved data like Ronak does, or heavily caching/storing retrieved results) If you think retrieval would be equally important to our users, then maybe keeping it in the same repo as pyserini guarantees better feature parity. Like some new retriever would be directly available for reranking too)"

I think retrieval is always going to be important to users (even in our pipelines), but at the end of the day, they might not use Pyserini for it. Academically, we do need these coupled well, and I'm sure the community will use it. Practically, I think people will just use some LangChain/LLaMAindex/Vespa most of the time which can interface with RankLLM and have a multi-stage system like that. At least so I think.

ronakice commented 2 months ago

Yup, I am not sure if it is worse than 2, it is just with some of the prereq baggage of 1, making 2 a bit annoying. I would say there's a lot to be done to make it easier and accessible to use, simplifying some workflows/consistency etc, but those can be worked on.

sahel-sh commented 2 months ago

I agree with @ronakice: as a retriever I think a hybrid search via LangChain or even simply BM25 via LangChain is as popular as Pyserini if not more. decoupling rankLLM from Pyserini, might increase its usability.

lintool commented 2 months ago

Seems like we're leaning to Approach 2. I concur with this decision.

castorini / rank_llm

Thoughts about design philosophy of RankLLM #109