Open onceagain8 opened 1 year ago
Hi, I think you're right. I trained the decision transformer in maze2d-medium-dense-v1 environment and calculated the normalized score with this command: env.get_normalized_score(average return of 100 episodes)
. However, I obtained a score of 56, which does not align with the reported maximum score of 35 in the paper " QDT: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL".
I wanted to know if you have calculated the expert score for maze2d-medium-dense-v1?
Hi, I think you're right. I trained the decision transformer in maze2d-medium-dense-v1 environment and calculated the normalized score with this command:
env.get_normalized_score(average return of 100 episodes)
. However, I obtained a score of 56, which does not align with the reported maximum score of 35 in the paper " QDT: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL". I wanted to know if you have calculated the expert score for maze2d-medium-dense-v1?
Hi, I'm also attempting to calculate normalized score with command: env.get_normalized_score(average return of 100 episodes)
in antmaze task , but can't get correct score repported in the paper. Have you found a solution to this issue?
Summary
Description:
Environment: maze2d
If you utilize the provided code (
scripts/reference_scores/maze2d_controller.py
) to calculate the score of the expert strategy, it may yield inaccurate results.The WaypointController strategy (expert strategy) may only produce accurate results in the initial episode. However, it is likely to fail in reaching the goal point during subsequent episodes in the maze2d environment.
Why this happen?
The issue arises from the expert strategy implemented in the
d4rl/pointmaze/waypoint_controller.py
file. Specifically, theget_action
function serves as the action selection mechanism for the expert strategy, and it contains the following code snippet:This code implies that the waypoints will only be recalculated when the endpoint changes.
Taking into consideration the code in
scripts/reference_scores/maze2d_controller.py
, it appears that theself._new_target()
function is executed solely at the beginning of the first episode. This is becauseenv.reset()
does not modify the endpoint. Consequently, in subsequent episodes, the waypoints will not be recalculated, and instead, the waypoints from the initial trajectory will be reused. As a result, the optimal strategy fails to achieve the desired outcome.Experiment
Upon incorporating
env.render()
into thescripts/reference_scores/maze2d_controller.py
file, it was observed that the expert strategy indeed fails to reach the target point. The video has been uploaded to Google Drive: