allenai / ScienceWorld

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
https://sciworld.apps.allenai.org/
Apache License 2.0
199 stars 24 forks source link

[Question] Train/Test episodes for goal-oriented agents #53

Closed yukioichida closed 9 months ago

yukioichida commented 9 months ago

Hello,

I have a question with respect to train and dev/test variation.

If I understood correctly, train variations have different goals than dev and test variations to generalize out of training knowledge. However, there are some specific knowledge that even humans depends on prior information about the element mentioned in the goal task. For instance, melting an unknown element might involve taking a different container and the agent should execute a plan for it, which is not manipulated/seen in training variations.

Measuring the out-of-distribution generalization w.r.t objects and location in the environment is an important task (e.g. objects in different rooms, different starting points), but I am curious about evaluating agents in unseen goals. Probably I am missing some point of your work since I am researching goal-driven agents, and I know that challenging environments is a large contribution for reasoning agents.

PeterAJansen commented 9 months ago

Thanks for your question!

The train, development, and test variations typically vary a number of bits: (a) the task-critical objects, (b) the agent's starting location, and (c) the specific non-critical items the environment is populated with (e.g. book shelves, furniture, etc).

Generally, (b) and (c) help make sure the agent is resistant to small changes in the environment with respect to it's ability to perform basic tasks like navigation, finding objects, etc., while being resistant to different kinds of distractors.

For varying the task-critical objects (a), this typically looks like varying the specific items that the agent has to use to solve a task. For change-of-state tasks, this might be requiring the agent to freeze water in a training variation, and apple juice in a test variation. For the electrical circuits, it might be powering different kinds of components, or with different forms of energy. For the genetic tasks, it might be determining if different traits are dominant or recessive. For the inclined plane tasks, it might be determining if different surfaces have more or less friction than the other.

In this way, in general, between train, development, and test variations, the high-level procedures tend to be very similar -- so you shouldn't notice any large differences, or training on wildly out-of-distribution procedures. Some of the change-of-state tasks do ablate some of the equipment in different variations (e.g. the stove might not work), and so the agent will have to find an alternate heating/cooling source, but that's about as difficult as the differences get between train/dev/test sets.

If you have more specific questions about differences, just let us know!

yukioichida commented 9 months ago

Hi @PeterAJansen , # Indeed, the goals in train and dev/test are similar, and my question concerns how the agent can achieve the goal state without prior knowledge about it. In my current work, I found a limitation in dealing with unseen goals since the scienceworld provides similar to goals seen in training variations, but not exactly the same in dev/test. Perhaps, we need to investigate how our approach can generalize beyond the task description and reason more about such similarities.

Thank you for your reply, it was very helpful. I will close this issue.

PeterAJansen commented 9 months ago

Oh I think I see the deeper question. The agent should be able to use an existing knowledge of elementary-school science and common-sense reasoning to be able to interact with the environment and reach the goal states (i.e., their knowledge of elementary science is requisite prior knowledge). If they don't have this background (e.g. a tabula rasa agent), it might be hard, though it's definitely possible that it might be able to learn these concepts (e.g. what changes of state are) just from interaction with the environment. Though that may take a very large number of steps.

For the tasks where the agent interacts with unknown things (e.g. measuring the properties of unknown substances), these are always explicitly labeled so it's very clear what specific object the agent should use (e.g. "Unknown Substance A"). These unknown substance tasks are largely paired with other tasks that have named substances (e.g. "water"), to test whether the agent understands the procedure (e.g. how to measure the boiling point of a substance), or whether they're just retrieving that knowledge from their voluminous knowledge bases.

yukioichida commented 9 months ago

Oh I think I see the deeper question. The agent should be able to use an existing knowledge of elementary-school science and common-sense reasoning to be able to interact with the environment and reach the goal states (i.e., their knowledge of elementary science is requisite prior knowledge). If they don't have this background (e.g. a tabula rasa agent), it might be hard, though it's definitely possible that it might be able to learn these concepts (e.g. what changes of state are) just from interaction with the environment. Though that may take a very large number of steps.

Exactly. The goal might be similar, and I totally agree that agents should generalize beyond the text of the task description and understand the semantics of natural language. However, the goal state is not necessarily the same between train and test variations (e.g. Your task is to boil water != Your task is to boil gallium) and, hence, it turns more challenging to deal with such tasks properly since the agent lacks of prior knowledge about the test variation goal.

For the tasks where the agent interacts with unknown things (e.g. measuring the properties of unknown substances), these are always explicitly labeled so it's very clear what specific object the agent should use (e.g. "Unknown Substance A"). These unknown substance tasks are largely paired with other tasks that have named substances (e.g. "water"), to test whether the agent understands the procedure (e.g. how to measure the boiling point of a substance), or whether they're just retrieving that knowledge from their voluminous knowledge bases.

Thanks for this explanation, currently we only focus on boiling, melting and change of matter tasks since most approaches results lowest scores.  We are planning to look further to all other scienceworld tasks in the future to deal with unknown substances and reasoning over it.