explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.3k stars 746 forks source link

Issue about tool call metrics. #1658

Open MattZ-99 opened 1 week ago

MattZ-99 commented 1 week ago

Hi all, it's surprising to see the ToolUse metrics in Ragas version v0.2.

Based on the base Metric class, I have implemented several useful metrics before. After the update of ToolUse section, I'm glad to contribute them for the open source.

As a new here, I'd like to have a discussion before pull request.

  1. Metrics for parallel calling.

Parallel function calling now is a common feature for current llm, where parallel means multiple independent/unordered function calls. For example, Berkeley Function-Calling Leaderboard provides a specialized "parallel" category.

The below case is the parallel tool call for two information.

sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)

Both ordered tool_calls should be acceptable, while current ToolCallAccuracy only supports ordered tool_calls. That is, ToolCallAccuracy is not suitable for parallel calling.

scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0

As for the solution, I have two ideas:

  1. Add another metric, 'ToolCallParallelAccuracy', which focus on parallel tool calls.
  2. Add a new parameter to ToolCallAccuracy, 'ordered_tool_calls=True/False', indicating whether the tool calls should be ordered.

which one would you think is more flexible?

Besides, I also have a small concern about current ToolCallAccuracy.

# ref from _tool_call_accuracy:76~
sequence_aligned = int(
    self.is_sequence_aligned(tool_call_pred_sequence, tool_call_ref_sequence)
)

if pred_tool_calls:
    score = 0.0
    reference_tool_calls = sample.reference_tool_calls
    for ref_tool_call in reference_tool_calls:
        for pred_tool_call in pred_tool_calls:
            ...

As the tool_call_pred and tool_call_pred is checked aligned, the reference_tool_calls and pred_tool_calls should not be go through the double loop. It would result in the following inconsistency:

sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0
sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0
MattZ-99 commented 1 week ago

Update: [report bug]

Now I'm sure the existing bug. Here I would list several ridiculous samples but with error evaluation scores.

sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
# 1.0
sample = [
    HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Los Angeles"}),
        ToolCall(name="match_check", args={"location": "Los Angeles"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="match_check", args={"location": "Los Angeles"}),
        ToolCall(name="weather_check", args={"location": "Los Angeles"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0
sample = [
    HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"}),
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0
sample = [
    HumanMessage(content="What's the weather like in New York?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]),
    HumanMessage(content="So what about Shanghai?"),
    AIMessage(content="...", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"}),
        ToolCall(name="weather_check", args={"location": "Shanghai"})
    ]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0
shahules786 commented 1 week ago

Hey @MattZ-99 this is incredibly useful. 1) parallel tool call: I also had thought about it while building tool call metrics. In real-world scenarios both parallel and sequence tool calls can occur within one agent. Ideally user should be able to evaluate both using single interface. 2) Order of tool calls are important (if not parallel), ie tool_Y might be consuming output of tool_X, so in this case it is important to make sure that tool_Y is only called after tool_X. There are many other such scenarios.