Closed daihuiao closed 1 year ago
Hi there, I think this is because that the D4RL datasets contain trajectories without 'timeout' limits, which is different from the default gym env (usually T=1000). This helps the agent to learn where are the true terminals.
For more details, please reach out D4RL repo.
thank you for your reply
I printed the reward information in the data set, but I got a particularly huge reward information, "Max return: 3780163.00, min: -6.61", and the trajectory length was 768445. Is there something wrong with the data set used?(env_name is walker2d-medium-expert-v2)