allenai / ScienceWorld

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
https://sciworld.apps.allenai.org/
Apache License 2.0
199 stars 24 forks source link

Deterministic solution judging #71

Closed minhphd closed 3 months ago

minhphd commented 3 months ago

I have encountered some issues. It seems that the dataset is designed to benchmark reasoning and problem-solving abilities in large language agents, but it currently lacks flexibility and fair grading criteria. It seems that the simulation has an imbalanced grade distribution, and the solutions are judged deterministically rather than more generally, which would better reflect appropriate solutions. For example

Task: Your task is to change the state of matter of water. First, focus on the substance. Then, take actions that will cause it to change its state of matter.

Solution: ['', 'teleport to kitchen', 'open cupboard', 'look in cupboard', 'pick up metal pot', 'move metal pot to sink', 'activate sink', 'deactivate sink', 'move metal pot to stove', 'activate stove', 'wait', 'wait', 'wait', 'deactivate stove', 'look at metal pot', 'focus on steam']

This solution addresses the task of boiling water properly, but it never achieved more than 5 points and ended up with -100 points regarding the focus on action. Furthermore, the action examine is not properly documented in the paper. Tasks are not properly described, as they seem to be more aimed toward measurement and experimentation rather than just solving the task.

PeterAJansen commented 3 months ago

Hi @minhphd , thanks for your issue report.

I think the issue here is not with ScienceWorld, but with the agent not following the task instructions:

Task: Your task is to change the state of matter of water. **First, focus on the substance.** Then, take actions that will cause it to change its state of matter.

If the agent focuses on the water somewhere early in its trajectory (e.g. focus on water), then it will attach the scorer to that object, so that it can monitor what happens to the water, including it's transition from a liquid to a gas. The trajectory you've provided doesn't do the focus action until the end, so the scorer is never attached to the object until the end -- and when it is attached, the object is in the wrong state (focus on steam), so it marks it as incorrect (-100).

To your broader concern about open solution grading criteria, the grading schemes were explicitly crafted to allow as large a range of solutions as we could imagine. For example, the grading scorecard for this task is:

Goal sequence progress: 
Completed keys: 
----------------------------------------------------------------------------------------------------
Sequential Subgoals:
----------------------------------------------------------------------------------------------------
0   false                                   GoalFind    focus on substance
1   false                    GoalChangeStateOfMatter    substance is in a liquid state
2   false                    GoalChangeStateOfMatter    substance is in a gaseous state (or combusting)
----------------------------------------------------------------------------------------------------
Unordered and Optional Subgoals:
----------------------------------------------------------------------------------------------------
0   false                       GoalInRoomWithObject    be in same location as water
1   false               GoalObjectsInSingleContainer    have substance alone in a single container
2   false                 GoalActivateDeviceWithName    activate heater (stove)
3   false                 GoalActivateDeviceWithName    activate heater (blast furnace)
4   false                 GoalActivateDeviceWithName    activate heater (oven)
5   false                 GoalActivateDeviceWithName    activate heater (hot plate)
6   false        GoalSpecificObjectInDirectContainer    have lighter in inventory
7   false        GoalSpecificObjectInDirectContainer    move wood into fire pit
8   false                      GoalTemperatureOnFire    ignite wood
9   false                      GoalObjectInContainer    have object on heater (stove)
10  false                      GoalObjectInContainer    have object on heater (blast furnace)
11  false                      GoalObjectInContainer    have object on heater (oven)
12  false                      GoalObjectInContainer    have object on heater (hot plate)
13  false                      GoalObjectInContainer    have object on heater (fire pit)
14  false                    GoalTemperatureIncrease    heat object by at least 20C
----------------------------------------------------------------------------------------------------

Which is agnostic to how the substance is heated, or what container it's in (indeed, one could even create firewood with the axe, build a campfire in the backyard, light the campfire with the lighter, and boil the water on that -- and it should still work). Similarly, in early testing, the agent could focus on water and accidentally (using the lighter) burn the house down, and the scorecard still marks this as successfully completing the task (though certainly non-traditionally, and I'd argue using a non-preferred solution). The scorecards can be accessed through the PythonAPI using the get_goal_progress() function: https://github.com/allenai/ScienceWorld/blob/35a9500210eac78097d621eac90dbd6b90b35a6d/scienceworld/scienceworld.py#L441C9-L441C26

More generally, the need for pointing out objects that you're going to intentionally change through some scientific process (using the focus action) is non-ideal, but it's not easy to come up with alternate scoring mechanisms due to the fidelity in the environment. For example, there's not just one water object in ScienceWorld, but many instances (and in different locations), and the agent pointing out which one its working with (like a human saying "I'm going to boil this water") makes measuring both success and intentionality much easier.

Does that help answer your question? Please let me know if you run into any other issues.

minhphd commented 3 months ago

Thank you! This answered my question. Thanks, I should have read more into the code base. This is a great dataset!