Closed aphedges closed 2 years ago
Hi Alex,
Thank you for this contribution!
I have worked a bunch to iron out the non-determinism in the environment (the simulator was originally designed to have a lot of random variability in it, so there has been a bit of work in making it deterministic. I think we have ironed out 99%+ of the issues).
For the two main sources of non-determinism in this commit (the order of the output text, e.g. "you see an apple and an orange" vs "you see an orange and an apple"), and the valid action ordering (e.g. ["eat apple", "eat orange"] vs ["eat orange", "eat apple"]), I intentionally kept these in to help encourage model robustness. But that was a decision I made entirely on my own and I'm very happy to be convinced it's better to be deterministic. I just worry about unintended consequences -- for example, I discovered an issue with another popular text environment with consistent deterministic output ordering/sorting, where the environment could be trivially solved by taking (something like) the 3rd action from the start then the 15th action from the end each time, and I'm hoping to avoid those issues as much as possible.
If the determinism in the output and valid action ordering are something we want in ScienceWorld, instead of sorting the lists, the technique I've been sometimes using is to shuffle the list using a random generator that's initialized on a seed that is the variation number for that environment. That way things are still repeatable, but they're not shuffled in alphabetical order, which might cause other opportunities for models to hijack regularities in the input. That might be an option here, too. The issue is that it's just slightly more complex, since the task variation/random generator/etc. isn't exposed all the way down to the object level, where a lot of this generation happens.
I appreciate that you took the time to respond! Sorry for taking so long to get back to you. I've been focused on other tasks.
For the two main sources of non-determinism in this commit (the order of the output text, e.g. "you see an apple and an orange" vs "you see an orange and an apple"), and the valid action ordering (e.g. ["eat apple", "eat orange"] vs ["eat orange", "eat apple"]), I intentionally kept these in to help encourage model robustness. But that was a decision I made entirely on my own and I'm very happy to be convinced it's better to be deterministic. I just worry about unintended consequences -- for example, I discovered an issue with another popular text environment with consistent deterministic output ordering/sorting, where the environment could be trivially solved by taking (something like) the 3rd action from the start then the 15th action from the end each time, and I'm hoping to avoid those issues as much as possible.
This makes a lot of sense. In the models I've been working with, I saw one run that simply went down the list of actions in order for successive turns. If we used that kind of ordering in training, I feel it can definitely cause some problems with model robustness.
However, I think it might make sense to have a distinction between text output and structured output from the ScienceWorld API. The output text is likely not being parsed and is being passed into a model, so it makes more sense to have it randomized. The structured output, such as the list of actions, can be easily sorted or shuffled by an end user as they see fit. Before I made this PR, I just used a sorted()
call on the output for consistency. I understand if this is something you don't want to take care of.
If the determinism in the output and valid action ordering are something we want in ScienceWorld, instead of sorting the lists, the technique I've been sometimes using is to shuffle the list using a random generator that's initialized on a seed that is the variation number for that environment. That way things are still repeatable, but they're not shuffled in alphabetical order, which might cause other opportunities for models to hijack regularities in the input. That might be an option here, too. The issue is that it's just slightly more complex, since the task variation/random generator/etc. isn't exposed all the way down to the object level, where a lot of this generation happens.
This sounds like a good compromise between determinism and robustness, but sorting is still needed for shuffling. For example, see the following: https://github.com/allenai/ScienceWorld/blob/22a70ce40e23ac9ca28e03f5b6f176a2922a8f50/simulator/src/main/scala/scienceworld/tasks/specifictasks/TaskFindLivingNonLiving.scala#L199
objsInRoom
is an unordered set, so even having a random seed set produces a different output on each run because of differences in Random.shuffle
's input.
If the solution is difficult, I completely understand if that won't be done.
I have two questions about moving this PR forward:
Make getValidActionObjectCombinations() deterministic
) and c47f3b4af02bc6f5d6dac31fb290f3ed3fbb8a54 (Make "look around" output deterministic
) the only ones you object to? If so, would dropping the commits and recompiling be sufficient to get this PR merged?Make gold path for finding non-living object deterministic
) will be needed.@aphedges I just merged your other PR with the build changes. Can you rebase your branch on the main one?
Also, I'm all for adding more determinism in the game generation and to control it with a dedicated RNG (like in TextWorld ;)).
@MarcCote, thanks! As part of the rebase, I'll drop some of the nondeterminism fixes until later. We (at ISI East) have found it useful so far and don't want to eliminate it until it can be controlled by a seed.
@MarcCote, it should be ready to merge
@aphedges can you open a separate PR or point me to the rest of the nondeterministic fixes.
@MarcCote, sure! I'll need to write up some stuff about them first, so I don't expect to get to them for a week or two.
@PeterAJansen are you okay with adding the .sorted
on the server side? It would had a bit of overhead on each .reset
and .step
calls. I'm thinking that this would impact speed/performance for your evolutionary algorithms.
Server side definitely makes sense, I think returning them from the API in the same order would be non-controversial even to me :)
While using ScienceWorld, I discovered multiple places where the program output was non-deterministic, even though the seed was set. This made my testing more difficult because each run could give a different output, even when given the same input, producing very large diffs. All problems I found were caused by the fact that Scala's and Python's set implementations have non-deterministic iteration orders. I did not do an exhaustive search, but I fixed all the cases I found during my testing. My solution was either to use a list if set access isn't needed or to sort the set contents before iteration. If this kind of contribution is welcome, I might make future PRs if other instances cause me problems.
While I was at it, I made various fixes and improvements to building the JAR file, such as fixing a bug in the
package.sh
and resolving somesbt
warnings. I updated the JAR itself in a separate commit in case you want to drop the commit and replace it with your own. I personally wouldn't trust a JAR file from a random person on the internet, so I completely understand if you don't, either.All code I used for testing is in test_files.zip. It includes a test script (
test_nondeterminism.py
), console output before and after my fixes (output_before.txt
andoutput_after.txt
) and JSON output of actions to make non-determinism easier to spot (actions_sorted.json
andactions_unsorted.json
).