I think there is a bug with the antmaze-umaze-diverse datasets relating to the rewards/terminal states. Versions 0 and 2.
For all other datasets there is a reward/terminal of 1 when the ant is near the goal (within 0.5 I believe). This means there are clusters of 1s in each trajectory. However for antmaze-umaze-diverse there is only a single reward/terminal of 1 when the ant reaches the goal.
This can be seen more clearly by examining the rewards in each dataset.
Hi there
I think there is a bug with the antmaze-umaze-diverse datasets relating to the rewards/terminal states. Versions 0 and 2.
For all other datasets there is a reward/terminal of 1 when the ant is near the goal (within 0.5 I believe). This means there are clusters of 1s in each trajectory. However for antmaze-umaze-diverse there is only a single reward/terminal of 1 when the ant reaches the goal.
This can be seen more clearly by examining the rewards in each dataset.
`# Load environments env = gym.make('antmaze-umaze-v0') dataset = d4rl.qlearning_dataset(env) print("UMaze", np.sum(dataset["rewards"] == 1))
env = gym.make('antmaze-umaze-diverse-v0') dataset = d4rl.qlearning_dataset(env) print("UMazeDiverse", np.sum(dataset["rewards"] == 1))
env = gym.make('antmaze-medium-play-v0') dataset = d4rl.qlearning_dataset(env) print("MediumPlay", np.sum(dataset["rewards"] == 1))
env = gym.make('antmaze-medium-diverse-v0') dataset = d4rl.qlearning_dataset(env) print("MediumDiverse", np.sum(dataset["rewards"] == 1))
env = gym.make('antmaze-large-play-v0') dataset = d4rl.qlearning_dataset(env) print("LargePlay", np.sum(dataset["rewards"] == 1))
env = gym.make('antmaze-large-diverse-v0') dataset = d4rl.qlearning_dataset(env) print("LargeDiverse", np.sum(dataset["rewards"] == 1))`
Target Goal: (0.4838825322097567, 8.732528317500655) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.87it/s] UMaze 8727 Target Goal: (1.0575017737777725, 8.70337244474823) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.82it/s] UMazeDiverse 36 Target Goal: (21.137351890943883, 20.475762531903587) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.83it/s] MediumPlay 9787 Target Goal: (20.652403557325318, 21.17031934473855) load datafile: 100%|██████████| 8/8 [00:01<00:00, 7.02it/s] MediumDiverse 1959 Target Goal: (32.36249313845059, 24.87809887647287) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.96it/s] LargePlay 12517 Target Goal: (32.10479442488056, 24.957476192143) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.99it/s] LargeDiverse 6189
You can see there are a lot less rewards==1 (i.e. reaching the goal) for antmaze-umaze-diverse
Is this a bug, or is it meant to be this way. Presumably the former so all datasets are consistent.
Many thanks