Farama-Foundation / D4RL

A collection of reference environments for offline reinforcement learning
Apache License 2.0
1.35k stars 285 forks source link

[Bug Report] antmaze-umaze-diverse-v0/v2 - reward/terminals inconsistent with other antmaze tasks #199

Open AlexBeesonWarwick opened 1 year ago

AlexBeesonWarwick commented 1 year ago

Hi there

I think there is a bug with the antmaze-umaze-diverse datasets relating to the rewards/terminal states. Versions 0 and 2.

For all other datasets there is a reward/terminal of 1 when the ant is near the goal (within 0.5 I believe). This means there are clusters of 1s in each trajectory. However for antmaze-umaze-diverse there is only a single reward/terminal of 1 when the ant reaches the goal.

This can be seen more clearly by examining the rewards in each dataset.

`# Load environments env = gym.make('antmaze-umaze-v0') dataset = d4rl.qlearning_dataset(env) print("UMaze", np.sum(dataset["rewards"] == 1))

env = gym.make('antmaze-umaze-diverse-v0') dataset = d4rl.qlearning_dataset(env) print("UMazeDiverse", np.sum(dataset["rewards"] == 1))

env = gym.make('antmaze-medium-play-v0') dataset = d4rl.qlearning_dataset(env) print("MediumPlay", np.sum(dataset["rewards"] == 1))

env = gym.make('antmaze-medium-diverse-v0') dataset = d4rl.qlearning_dataset(env) print("MediumDiverse", np.sum(dataset["rewards"] == 1))

env = gym.make('antmaze-large-play-v0') dataset = d4rl.qlearning_dataset(env) print("LargePlay", np.sum(dataset["rewards"] == 1))

env = gym.make('antmaze-large-diverse-v0') dataset = d4rl.qlearning_dataset(env) print("LargeDiverse", np.sum(dataset["rewards"] == 1))`

Target Goal: (0.4838825322097567, 8.732528317500655) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.87it/s] UMaze 8727 Target Goal: (1.0575017737777725, 8.70337244474823) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.82it/s] UMazeDiverse 36 Target Goal: (21.137351890943883, 20.475762531903587) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.83it/s] MediumPlay 9787 Target Goal: (20.652403557325318, 21.17031934473855) load datafile: 100%|██████████| 8/8 [00:01<00:00, 7.02it/s] MediumDiverse 1959 Target Goal: (32.36249313845059, 24.87809887647287) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.96it/s] LargePlay 12517 Target Goal: (32.10479442488056, 24.957476192143) load datafile: 100%|██████████| 8/8 [00:01<00:00, 6.99it/s] LargeDiverse 6189

You can see there are a lot less rewards==1 (i.e. reaching the goal) for antmaze-umaze-diverse

Is this a bug, or is it meant to be this way. Presumably the former so all datasets are consistent.

Many thanks

quantumiracle commented 1 year ago

This is the same in original Berkeley's dataset, try:

pip install d4rl@git+https://github.com/rail-berkeley/d4rl@d842aa194b416e564e54b0730d9f934e3e32f854