Farama-Foundation / D4RL

A collection of reference environments for offline reinforcement learning
Apache License 2.0
1.36k stars 285 forks source link

[Question] Expert score for maze2d environment may be wrong #215

Open onceagain8 opened 1 year ago

onceagain8 commented 1 year ago

Summary

  1. There are issues with the scoring calculation of expert strategies in the maze2d environment.
  2. The incorrect scoring calculation is a result of the expert strategies not being called properly.
  3. The scores of the experts should be higher than the current scores.

    Description:

    Environment: maze2d

If you utilize the provided code (scripts/reference_scores/maze2d_controller.py) to calculate the score of the expert strategy, it may yield inaccurate results.

env = gym.make(args.env_name)
env.seed(0)
np.random.seed(0)
controller = waypoint_controller.WaypointController(env.str_maze_spec)

ravg = []
for _ in range(args.num_episodes):
    s = env.reset()
    returns = 0
    for t in range(env._max_episode_steps):
        position = s[0:2]
        velocity = s[2:4]
        act, done = controller.get_action(position, velocity, env.get_target())
        s, rew, _, _ = env.step(act)
        returns += rew
    ravg.append(returns)
print(args.env_name, 'returns', np.mean(ravg))

The WaypointController strategy (expert strategy) may only produce accurate results in the initial episode. However, it is likely to fail in reaching the goal point during subsequent episodes in the maze2d environment.

Why this happen?

The issue arises from the expert strategy implemented in the d4rl/pointmaze/waypoint_controller.py file. Specifically, the get_action function serves as the action selection mechanism for the expert strategy, and it contains the following code snippet:

if np.linalg.norm(self._target - np.array(self.gridify_state(target))) > 1e-3: 
    #print('New target!', target, 'old:', self._target)
    self._new_target(location, target)

This code implies that the waypoints will only be recalculated when the endpoint changes.

Taking into consideration the code in scripts/reference_scores/maze2d_controller.py, it appears that the self._new_target() function is executed solely at the beginning of the first episode. This is because env.reset() does not modify the endpoint. Consequently, in subsequent episodes, the waypoints will not be recalculated, and instead, the waypoints from the initial trajectory will be reused. As a result, the optimal strategy fails to achieve the desired outcome.

Experiment

Upon incorporating env.render() into the scripts/reference_scores/maze2d_controller.py file, it was observed that the expert strategy indeed fails to reach the target point. The video has been uploaded to Google Drive:

https://drive.google.com/file/d/13OF_z3hBAzcxX5upg6byVZrteWnJeFao/view?usp=sharing.

After making modifications to the code, I conducted a re-evaluation of the expert strategy across different environments. The results are presented below: env_name maze2d-umaze-v1 maze2d-medium-v1 maze2d-large-v1
expert policy(new) 223.48 420.48 551.23
expert policy(old) 161.86 277.39 273.99
HamedDi81 commented 11 months ago

Hi, I think you're right. I trained the decision transformer in maze2d-medium-dense-v1 environment and calculated the normalized score with this command: env.get_normalized_score(average return of 100 episodes). However, I obtained a score of 56, which does not align with the reported maximum score of 35 in the paper " QDT: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL". I wanted to know if you have calculated the expert score for maze2d-medium-dense-v1?

zhyaoch commented 8 months ago

Hi, I think you're right. I trained the decision transformer in maze2d-medium-dense-v1 environment and calculated the normalized score with this command: env.get_normalized_score(average return of 100 episodes). However, I obtained a score of 56, which does not align with the reported maximum score of 35 in the paper " QDT: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL". I wanted to know if you have calculated the expert score for maze2d-medium-dense-v1?

Hi, I'm also attempting to calculate normalized score with command: env.get_normalized_score(average return of 100 episodes) in antmaze task , but can't get correct score repported in the paper. Have you found a solution to this issue?