Training and inference context lengths

Hello,

In order to use the open-source oasis model effectively, some necessary information is missing:

What is the maximum sequence length that the oasis500m model was trained on?
Was the model trained with masking strategies like sliding window attention and/or transformer-XL style recurrence?
How do you handle context length at inference time? Do you simply discard the oldest tokens in the KV cache after a maximum time horizon, or do you simply stop generating once the KV cache is maxed out?
In the provided generate.py script, the noise schedule seems to apply a uniform noise level to all context tokens, approximately min(current step noise level, 300), rather than e.g. the pyramid scheduler described in the original diffusion forcing paper. Was this noise schedule selected heuristically or was it tuned specifically for this project, and is it the same schedule used for the live demo?

Thanks!

etched-ai / open-oasis