allenai / ScienceWorld

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
https://sciworld.apps.allenai.org/
Apache License 2.0
197 stars 22 forks source link

Disallow new steps/actions if the isCompleted flag is set #31

Open PeterAJansen opened 1 year ago

PeterAJansen commented 1 year ago

Certain conditions set the isCompleted flag to be set -- for example, a negative score in the task, signifying task failure:

https://github.com/allenai/ScienceWorld/blob/10dd21a483aca06bbd591a5465353f719dfaffc3/scienceworld/scienceworld.py#L328

Right now we set the isCompleted flag to be true, but if an agent doesn't check for this, it's still possible for it to continue to send commands to the environment. There is a report that this might let the tasks be gamed (especially the forced-choice tasks), as the agent might be able to take steps that ultimately further increase the score. We should likely modify the step() code so that if the isCompleted flag is set, it disallows further steps to be processed, to prevent any issues with agents reporting erroneously high scores in the future.

MarcCote commented 1 year ago

Should that be handled on the server side?

PeterAJansen commented 1 year ago

It sounds like a good idea to handle it on the server side.

It's also been suggested that we have this as a flag that can be set, so let folks choose whether they continue to run after task failure -- that should be easy enough to add, as long as we make sure that folks report this in their papers, and add some very obvious output somewhere to remind folks which mode they're using. :)

MarcCote commented 1 year ago

That seems weird. What is the need for ignoring task failure? Do they want a no-goal variation, ie just exploration mode?

yuchenlin commented 1 year ago

Hey Marc and Peter,

I actually really want the feature for "ignoring task failure" so that we can have an env that enables agent to learn from their failures and acquire the knowledge for avoid such failures. I can manually set this to be False, but it seems that I cannot change the negative score (-1) back to its original score on server side, even though the game can be continued. Any suggestions?

More ideas on this issue: I think there are two types of task failure: 1) some actions that cannot reverses and these are indeed need to be disallowed, and 2) some failures caused by unclear issues. Say, in the task of boil something. I found that if agent pick up X and X is in its inventory. Then, when the agent wants to focus on X, then it will cause a negative score and the task ends. This might be a bit too strict for evaluation.

Thank you very much! :D

PeterAJansen commented 1 year ago

The issue with focusing is that it's a critical part of the evaluation for most tasks, and that removing the hard criterion that the agent has to focus on the right thing would allow an agent to quickly game the tasks. For example, if the task is to measure something, and focus on box A if it's greater than some threshold, and focus on box B if it's less than some threshold, the agents will quickly learn to just focus on box A then B and get 100% task performance in two steps. Adding the focus mechanism was a method of (a) ensuring that the actions are intentional, and (b) providing a method of scoring that doesn't rely on natural language generation, but instead uses an analog of a forced choice task. If you remove the failure of the forced choice task, then it's like being able to select every multiple choice answer on a multiple choice test. :)

yuchenlin commented 1 year ago

Thank you very much for the explanations, Peter! :D Really appreciate it.

MarcCote commented 1 year ago

Say, in the task of boil something. I found that if agent pick up X and X is in its inventory. Then, when the agent wants to focus on X, then it will cause a negative score and the task ends. This might be a bit too strict for evaluation.

That seems like an issue. The agent should be able to focus on the task object even if it is already in its inventory.

PeterAJansen commented 1 year ago

Hmm, I think in this case the object it's focusing on isn't a task object, right @yuchenlin ? Could you copy/paste the playthrough, and we could figure out if there's an issue?