etched-ai / open-oasis

Inference script for Oasis 500M
MIT License
1.51k stars 126 forks source link

Training and inference context lengths #19

Open jxiong21029 opened 2 weeks ago

jxiong21029 commented 2 weeks ago

Hello,

In order to use the open-source oasis model effectively, some necessary information is missing:

  1. What is the maximum sequence length that the oasis500m model was trained on?
  2. Was the model trained with masking strategies like sliding window attention and/or transformer-XL style recurrence?
  3. How do you handle context length at inference time? Do you simply discard the oldest tokens in the KV cache after a maximum time horizon, or do you simply stop generating once the KV cache is maxed out?
  4. In the provided generate.py script, the noise schedule seems to apply a uniform noise level to all context tokens, approximately min(current step noise level, 300), rather than e.g. the pyramid scheduler described in the original diffusion forcing paper. Was this noise schedule selected heuristically or was it tuned specifically for this project, and is it the same schedule used for the live demo?

Thanks!

julian-q commented 1 week ago

Thanks for your detailed questions @jxiong21029

  1. 32 frames
  2. Just trained on 32 frame sequences
  3. Yep we discard the tokens past the latest 32 frames!
  4. We experimented with different noise schedules and chose what works best. It seems that using a constant noise level for the context works well when you want to fully denoise each new frame one at a time. (As opposed to some use cases of Diffusion Forcing where you progressively denoise future frames.)