Frame and action offset

In generate.py, it appears that the model is conditioning on the action of the current timestep when denoising the current frame. In the sample data, e.g. snippy-chartreuse-mastiff-f79998db196d-20220401-224517.chunk_001 from the VPT contractor dataset, I believe that the action at index t corresponds to the action the contractor took after seeing frame t, which has an effect you would only see on frame t + 1?

In other words, it seems like you are conditioning on the action taken after the current frame when generating the current frame, when in reality the current frame should be conditioned on only actions up until the last action taken before this frame. So this would introduce extra latency of 1 timestep ~= 50 ms before an action takes effect, when running inference.

I'm not 100% confident about these findings, but assuming my analysis was correct, would this issue be limited to only when generate.py is used with this specific example, or would this reflect the way that observations were paired with actions during training of the oasis models in general?

etched-ai / open-oasis

Frame and action offset #20