etched-ai / open-oasis

Inference script for Oasis 500M
MIT License
1.53k stars 128 forks source link

Frame and action offset #20

Closed jxiong21029 closed 2 weeks ago

jxiong21029 commented 3 weeks ago

In generate.py, it appears that the model is conditioning on the action of the current timestep when denoising the current frame. In the sample data, e.g. snippy-chartreuse-mastiff-f79998db196d-20220401-224517.chunk_001 from the VPT contractor dataset, I believe that the action at index t corresponds to the action the contractor took after seeing frame t, which has an effect you would only see on frame t + 1?

In other words, it seems like you are conditioning on the action taken after the current frame when generating the current frame, when in reality the current frame should be conditioned on only actions up until the last action taken before this frame. So this would introduce extra latency of 1 timestep ~= 50 ms before an action takes effect, when running inference.

I'm not 100% confident about these findings, but assuming my analysis was correct, would this issue be limited to only when generate.py is used with this specific example, or would this reflect the way that observations were paired with actions during training of the oasis models in general?

julian-q commented 2 weeks ago

Thanks @jxiong21029 - I think we were missing the "all zeros" action that we usually concatenate to the beginning of the sequence. Fixed!