Open MattZ-99 opened 1 week ago
Update: [report bug]
Now I'm sure the existing bug. Here I would list several ridiculous samples but with error evaluation scores.
sample = [
HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Shanghai"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]
)
# 1.0
sample = [
HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Los Angeles"}),
ToolCall(name="match_check", args={"location": "Los Angeles"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="match_check", args={"location": "Los Angeles"}),
ToolCall(name="weather_check", args={"location": "Los Angeles"})
]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0
sample = [
HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Shanghai"}),
ToolCall(name="weather_check", args={"location": "New York"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0
sample = [
HumanMessage(content="What's the weather like in New York?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Shanghai"})
]),
HumanMessage(content="So what about Shanghai?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0
Hey @MattZ-99 this is incredibly useful. 1) parallel tool call: I also had thought about it while building tool call metrics. In real-world scenarios both parallel and sequence tool calls can occur within one agent. Ideally user should be able to evaluate both using single interface. 2) Order of tool calls are important (if not parallel), ie tool_Y might be consuming output of tool_X, so in this case it is important to make sure that tool_Y is only called after tool_X. There are many other such scenarios.
Hi all, it's surprising to see the ToolUse metrics in Ragas version v0.2.
Based on the base
Metric
class, I have implemented several useful metrics before. After the update of ToolUse section, I'm glad to contribute them for the open source.As a new here, I'd like to have a discussion before pull request.
Parallel function calling now is a common feature for current llm, where
parallel
means multiple independent/unordered function calls. For example, Berkeley Function-Calling Leaderboard provides a specialized "parallel" category.The below case is the parallel tool call for two information.
Both ordered tool_calls should be acceptable, while current ToolCallAccuracy only supports ordered tool_calls. That is,
ToolCallAccuracy
is not suitable for parallel calling.As for the solution, I have two ideas:
which one would you think is more flexible?
Besides, I also have a small concern about current
ToolCallAccuracy
.As the
tool_call_pred
andtool_call_pred
is checked aligned, thereference_tool_calls
andpred_tool_calls
should not be go through the double loop. It would result in the following inconsistency: