Closed cloderic closed 1 year ago
@cloderic I don't think the following code in the sample producer represents the terminal condition of the environment i.e. done
done = sample.trial_state == cogment.TrialState.ENDED
The environment and sample producer can run asynchronously, potentially ending trials before all samples are collected. In an experiment with Pong games, I manually set the terminal condition to end after 11 steps, but the sample producer reported the environment had ended after only 3 steps due to that line of code. This caused incorrect results, as all computations involving the environment's terminal conditions were needed for PPO with replay buffer What do you think?
This will need investigation
potentially ending trials before all samples are collected.
This shouldn't happen, if it does it's a bug in the environment, the orchestrator or the datastore.
Just so that I understand, do you have the expected behavior if the trial ends "on its own" (basically if the environment ends it) ?
Yes, I do. I manually set the final observation to zeros when the environment ends. However, it did not provide this final observation in the sample producer when done
is true
. FYI: I only ran with a single trial for this test in order to debug the issue. I will investigate further on this issue
The way it should work is described here -> https://cogment.ai/docs/guide/development-guide#trial-end
@ha --> performance not good yet + coordinate with @wduguay-air
Goal Train an AI agent using PPO with replay buffer to achieve good performance on the Pong game in the Petting Zoo environment
Acceptance Criteria