Closed AOAA96 closed 2 months ago
@araffin sorry I am not clear on what I am missing from the checklist.
The provided code is not minimal or working (please check the link for an explanation) and you should solve the env checker warnings first.
Changing the action space to be between -1 and 1 resolved the issue.
š Bug
The element in index number 2 in the next observation is the action selected by the TD3 agent based on the current observation. For example, if the current observation is [0, 0.23, 0.45, 0.85], and the action is 0.55, the next observation would be something like [0.1, 0.25, 0.55, 0.80]. This works well when I test a step in the environment, and I have provided an image that demonstrates this. Check the image called "checking_env". Notice how the actions correctly correspond to the element in index number 2 in the next observation. I have color-coded for ease of reading.
However, the same is not observed in the replay buffer samples. In the replay buffer samples, the action selected by the agent does not correspond to the element in index number 2 in the next observation. I believe this is somehow affecting the learning of the agent. Check the image called "replaybuffer". The element in index number 2 of the observation is also always clipped between 0 and 1. Notice how the fourth action in the replay buffer is not clipped to zero in the fourth "next_observation".
Just to rule out that it is not noise causing this discrepancy in the replay buffer, the action_noise argument in the TD3 model was kept as None.
I would love to share more code, but it will involve some sensitive information that I do not know yet if I can share.
Code example
Action and observation spaces from init.
Step and Reset methods.
Relevant log output / Error message
No response
System Info
No response
Checklist