In order to use the open-source oasis model effectively, some necessary information is missing:
What is the maximum sequence length that the oasis500m model was trained on?
Was the model trained with masking strategies like sliding window attention and/or transformer-XL style recurrence?
How do you handle context length at inference time? Do you simply discard the oldest tokens in the KV cache after a maximum time horizon, or do you simply stop generating once the KV cache is maxed out?
In the provided generate.py script, the noise schedule seems to apply a uniform noise level to all context tokens, approximately min(current step noise level, 300), rather than e.g. the pyramid scheduler described in the original diffusion forcing paper. Was this noise schedule selected heuristically or was it tuned specifically for this project, and is it the same schedule used for the live demo?
Yep we discard the tokens past the latest 32 frames!
We experimented with different noise schedules and chose what works best. It seems that using a constant noise level for the context works well when you want to fully denoise each new frame one at a time. (As opposed to some use cases of Diffusion Forcing where you progressively denoise future frames.)
Hello,
In order to use the open-source oasis model effectively, some necessary information is missing:
generate.py
script, the noise schedule seems to apply a uniform noise level to all context tokens, approximately min(current step noise level, 300), rather than e.g. the pyramid scheduler described in the original diffusion forcing paper. Was this noise schedule selected heuristically or was it tuned specifically for this project, and is it the same schedule used for the live demo?Thanks!