Feature specs discussion board for umBRELA

UShivani3 commented 4 months ago

I am starting this thread for feature spec discussion for umBRELA @lintool @ronakice.

Suggestions from my side:

parameter for specifying the number of samples for inference and later performing voting to get majority results.
Maybe add some additional instructions to the prompt to guide the LLM. We already have an option of specifying a prompt file, so I'm not sure how useful this could be.
If the input dictionary already includes a relevance label, we can add a key called correctness of the label in output. This could be a feature for verifying already available relevance assessments.

ronakice commented 4 months ago

@UShivani3 can you give some demo usage here so @lintool is aware of the exacts of the framework so far? Some snippets.

UShivani3 commented 4 months ago

Yes, my bad!

Here is the snippet

Setting up the model judge:

from umbrela.vicuna_judge import VicunaJudge

judge_vicuna = VicunaJudge("dl19-passage")

Passing qrel-passages for evaluations:

input_dict = {
    "query": {"text": "how long is life cycle of flea", "qid": "264014"},
    "candidates": [
        {
            "doc": {
                "segment": "The life cycle of a flea can last anywhere from 20 days to an entire year. It depends on how long the flea remains in the dormant stage (eggs, larvae, pupa). Outside influences, such as weather, affect the flea cycle. A female flea can lay around 20 to 25 eggs in one day."
            },
            "docid": "4834547",
            "score": 14.971799850463867,
        },
    ]
}

judgments = judge_vicuna.judge(input_dict)

Output format for each judgment:

judgment = {
          "model": model_name,
          "query": query,
          "passage": passage,
          "prompt": prompt,
          "prediction": model_response,
          "judgment": relevance_label_after_parsing_model_response,
          }

I have also added a sample code using OSLLMJudge class here: https://github.com/castorini/umbrela/blob/main/src/eval/test.py.

ronakice commented 4 months ago

@thakur-nandan can you give your thoughts on the design so far too?

thakur-nandan commented 4 months ago

Sure, thanks @UShivani3, overall I like the minimalistic code and easy-to-use repository design. Both prompts look good. The installation instructions in the README are helpful.

One suggestion I have is to decouple the prompt with LLM judge code, This will in the future complicate as one would need to keep on updating the base LLMJudge shown below with newer prompts as shown below: https://github.com/castorini/umbrela/blob/05ae4262d11849c5fb311c4043866e77b9140376/src/umbrela/llm_judge.py#L21

How I think we can restructure the design:

PromptTemplate class: This will take in prompt_type and prompt_file and fewshot_count as the input and output any prompt we like (either bing or basic) for the query-passage pair.
LLMJudge Class: This will take in the prompt from the PromptTemplate class as the input and output the relevance judgment.

@ronakice @UShivani3 would be happy to take your suggestions.

thakur-nandan commented 4 months ago

One more question: @UShivani3 what does the score in the input_dict signify? Is this a retrieval/reranking score?

Does it affect the LLMJudge response?

castorini / umbrela