allenai / ScienceWorld

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.
https://sciworld.apps.allenai.org/
Apache License 2.0
213 stars 26 forks source link

Fix some non-determinism #15

Closed aphedges closed 2 years ago

aphedges commented 2 years ago

While using ScienceWorld, I discovered multiple places where the program output was non-deterministic, even though the seed was set. This made my testing more difficult because each run could give a different output, even when given the same input, producing very large diffs. All problems I found were caused by the fact that Scala's and Python's set implementations have non-deterministic iteration orders. I did not do an exhaustive search, but I fixed all the cases I found during my testing. My solution was either to use a list if set access isn't needed or to sort the set contents before iteration. If this kind of contribution is welcome, I might make future PRs if other instances cause me problems.

While I was at it, I made various fixes and improvements to building the JAR file, such as fixing a bug in the package.sh and resolving some sbt warnings. I updated the JAR itself in a separate commit in case you want to drop the commit and replace it with your own. I personally wouldn't trust a JAR file from a random person on the internet, so I completely understand if you don't, either.

All code I used for testing is in test_files.zip. It includes a test script (test_nondeterminism.py), console output before and after my fixes (output_before.txt and output_after.txt) and JSON output of actions to make non-determinism easier to spot (actions_sorted.json and actions_unsorted.json).

PeterAJansen commented 2 years ago

Hi Alex,

Thank you for this contribution!

I have worked a bunch to iron out the non-determinism in the environment (the simulator was originally designed to have a lot of random variability in it, so there has been a bit of work in making it deterministic. I think we have ironed out 99%+ of the issues).

For the two main sources of non-determinism in this commit (the order of the output text, e.g. "you see an apple and an orange" vs "you see an orange and an apple"), and the valid action ordering (e.g. ["eat apple", "eat orange"] vs ["eat orange", "eat apple"]), I intentionally kept these in to help encourage model robustness. But that was a decision I made entirely on my own and I'm very happy to be convinced it's better to be deterministic. I just worry about unintended consequences -- for example, I discovered an issue with another popular text environment with consistent deterministic output ordering/sorting, where the environment could be trivially solved by taking (something like) the 3rd action from the start then the 15th action from the end each time, and I'm hoping to avoid those issues as much as possible.

If the determinism in the output and valid action ordering are something we want in ScienceWorld, instead of sorting the lists, the technique I've been sometimes using is to shuffle the list using a random generator that's initialized on a seed that is the variation number for that environment. That way things are still repeatable, but they're not shuffled in alphabetical order, which might cause other opportunities for models to hijack regularities in the input. That might be an option here, too. The issue is that it's just slightly more complex, since the task variation/random generator/etc. isn't exposed all the way down to the object level, where a lot of this generation happens.

aphedges commented 2 years ago

I appreciate that you took the time to respond! Sorry for taking so long to get back to you. I've been focused on other tasks.

For the two main sources of non-determinism in this commit (the order of the output text, e.g. "you see an apple and an orange" vs "you see an orange and an apple"), and the valid action ordering (e.g. ["eat apple", "eat orange"] vs ["eat orange", "eat apple"]), I intentionally kept these in to help encourage model robustness. But that was a decision I made entirely on my own and I'm very happy to be convinced it's better to be deterministic. I just worry about unintended consequences -- for example, I discovered an issue with another popular text environment with consistent deterministic output ordering/sorting, where the environment could be trivially solved by taking (something like) the 3rd action from the start then the 15th action from the end each time, and I'm hoping to avoid those issues as much as possible.

This makes a lot of sense. In the models I've been working with, I saw one run that simply went down the list of actions in order for successive turns. If we used that kind of ordering in training, I feel it can definitely cause some problems with model robustness.

However, I think it might make sense to have a distinction between text output and structured output from the ScienceWorld API. The output text is likely not being parsed and is being passed into a model, so it makes more sense to have it randomized. The structured output, such as the list of actions, can be easily sorted or shuffled by an end user as they see fit. Before I made this PR, I just used a sorted() call on the output for consistency. I understand if this is something you don't want to take care of.

If the determinism in the output and valid action ordering are something we want in ScienceWorld, instead of sorting the lists, the technique I've been sometimes using is to shuffle the list using a random generator that's initialized on a seed that is the variation number for that environment. That way things are still repeatable, but they're not shuffled in alphabetical order, which might cause other opportunities for models to hijack regularities in the input. That might be an option here, too. The issue is that it's just slightly more complex, since the task variation/random generator/etc. isn't exposed all the way down to the object level, where a lot of this generation happens.

This sounds like a good compromise between determinism and robustness, but sorting is still needed for shuffling. For example, see the following: https://github.com/allenai/ScienceWorld/blob/22a70ce40e23ac9ca28e03f5b6f176a2922a8f50/simulator/src/main/scala/scienceworld/tasks/specifictasks/TaskFindLivingNonLiving.scala#L199

objsInRoom is an unordered set, so even having a random seed set produces a different output on each run because of differences in Random.shuffle's input.

If the solution is difficult, I completely understand if that won't be done.


I have two questions about moving this PR forward:

MarcCote commented 2 years ago

@aphedges I just merged your other PR with the build changes. Can you rebase your branch on the main one?

MarcCote commented 2 years ago

Also, I'm all for adding more determinism in the game generation and to control it with a dedicated RNG (like in TextWorld ;)).

aphedges commented 2 years ago

@MarcCote, thanks! As part of the rebase, I'll drop some of the nondeterminism fixes until later. We (at ISI East) have found it useful so far and don't want to eliminate it until it can be controlled by a seed.

aphedges commented 2 years ago

@MarcCote, it should be ready to merge

MarcCote commented 2 years ago

@aphedges can you open a separate PR or point me to the rest of the nondeterministic fixes.

aphedges commented 2 years ago

@MarcCote, sure! I'll need to write up some stuff about them first, so I don't expect to get to them for a week or two.

MarcCote commented 2 years ago

@PeterAJansen are you okay with adding the .sorted on the server side? It would had a bit of overhead on each .reset and .step calls. I'm thinking that this would impact speed/performance for your evolutionary algorithms.

PeterAJansen commented 2 years ago

Server side definitely makes sense, I think returning them from the API in the same order would be non-controversial even to me :)