castorini / umbrela

Apache License 2.0
22 stars 2 forks source link

Feature specs discussion board for umBRELA #1

Open UShivani3 opened 4 months ago

UShivani3 commented 4 months ago

I am starting this thread for feature spec discussion for umBRELA @lintool @ronakice.

Suggestions from my side:

ronakice commented 4 months ago

@UShivani3 can you give some demo usage here so @lintool is aware of the exacts of the framework so far? Some snippets.

UShivani3 commented 4 months ago

Yes, my bad!

Here is the snippet

Setting up the model judge:

from umbrela.vicuna_judge import VicunaJudge

judge_vicuna = VicunaJudge("dl19-passage")

Passing qrel-passages for evaluations:

input_dict = {
    "query": {"text": "how long is life cycle of flea", "qid": "264014"},
    "candidates": [
        {
            "doc": {
                "segment": "The life cycle of a flea can last anywhere from 20 days to an entire year. It depends on how long the flea remains in the dormant stage (eggs, larvae, pupa). Outside influences, such as weather, affect the flea cycle. A female flea can lay around 20 to 25 eggs in one day."
            },
            "docid": "4834547",
            "score": 14.971799850463867,
        },
    ]
}

judgments = judge_vicuna.judge(input_dict)

Output format for each judgment:

judgment = {
          "model": model_name,
          "query": query,
          "passage": passage,
          "prompt": prompt,
          "prediction": model_response,
          "judgment": relevance_label_after_parsing_model_response,
          }

I have also added a sample code using OSLLMJudge class here: https://github.com/castorini/umbrela/blob/main/src/eval/test.py.

ronakice commented 4 months ago

@thakur-nandan can you give your thoughts on the design so far too?

thakur-nandan commented 4 months ago

Sure, thanks @UShivani3, overall I like the minimalistic code and easy-to-use repository design. Both prompts look good. The installation instructions in the README are helpful.

One suggestion I have is to decouple the prompt with LLM judge code, This will in the future complicate as one would need to keep on updating the base LLMJudge shown below with newer prompts as shown below: https://github.com/castorini/umbrela/blob/05ae4262d11849c5fb311c4043866e77b9140376/src/umbrela/llm_judge.py#L21

How I think we can restructure the design:

@ronakice @UShivani3 would be happy to take your suggestions.

thakur-nandan commented 4 months ago

One more question: @UShivani3 what does the score in the input_dict signify? Is this a retrieval/reranking score?

Does it affect the LLMJudge response?