[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0"

jamartinh commented 1 year ago

Question

Hi, please state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0"

With previous Offline RL libs, there has been some confusion with this respect.

With the standar in RL being (s,a,r,s') one assume that r is a consequence of applying action a in state s.

If r is not, please state it clearly, because then, the r(s,a) should be r_1 and not r_0

Thanks !

balisujohn commented 1 year ago

I agree that's good to mention.

It is implied in this code block: https://minari.farama.org/main/content/dataset_standards/

But I think further clarification wouldn't hurt, so I'll make a PR.

jamartinh commented 1 year ago

Thanks @balisujohn , now I am even more confused.

For me, this is a DejaVu from when working on the D3RLpy Offline RL library. The code:

https://github.com/Farama-Foundation/Minari/blob/7d1682961039cabd4aae54a1ddb55ed64e877f1e/minari/data_collector/data_collector.py#L182-L193

With this data collector: the action and its reward as a consequence are on the same "row", however the state in which the action was taken is not in the same "row".

Also, in previous discussions on these kids of datasets, we concluded that the "original" D4RL datasets where in the format that is actualy used in the replay buffers implemented in almost all RL libraries:

$$(s,a,r,s', terminated, truncated, info)$$

E.g., a "full iteration" not just an env.step

So in just one "row" we have the state, the action taken in that state, the corresponding reward for taking action in current state , the subsequent state (required by onpolicy learning), terminal flags info etc.

This is basically the format of a Replay Buffer, the format that one expects from a dataset that describes a control task for using it with RL.

In the documentation: is $t$ starting from 0? or from 1?

What is the value of $r_0$ ?

As said, thanks again, but please, take ALL THE CARE with this issue since it is a brainer for people and also a "hidden" source of bad training. People can commit severe mistakes in using the data for training assuming something that is not the thing.

rodrigodelazcano commented 1 year ago

Hi @jamartinh . The datasets have the structure you are looking for: The datacollector stores the action and its reward as a consequence on the same "row", and the state in which the action was taken is ALSO in the same "row". The first state that is recorded is the one returned by env.reset() thus the first timestep is t=0. You can also see that the last state of an episode is stored since each episode's observations array has an extra element compared to the rewards or actions arrays.

@balisujohn shared the code to convert an episodes data to the (s,a,r,s') format. However, I agree that we should update the documentation to make this clearer or add an extra array in each episode for s', sorry for the confusion.

Farama-Foundation / Minari

[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0" #74

Question