What watershed do in precompute_freq_cis function

In short, it is a watershed w.r.t. time step, before which position embedding is linearly scaled, and after which position embedding is NTK scaled.

More details: to make a model trained at 1k resolution to generate images at 1.5k or higher resolutions, an extrapolation on position embedding (i.e. RoPE in Lumina) is needed. We find that linear RoPE scaling leads to good global structure and composition, but the nearby pixels tend not to be harmonious; In contrast, NTK scaling makes good local texture, but global structure is usually unreasonable. Therefore, we use a combination of them two, applying linear scaling in the initial diffusion steps to define the global composition (intuitively like to draw a draft), and then switch to NTK for high-quality texture. It follows the same intuition as the method introduced in Sec 2.2 of the Lumina-Next paper but usually behaves more stably.

This method is very simple w.r.t. implementation https://github.com/Alpha-VLLM/Lumina-T2X/blob/7bc7d7d70a20a262b4f04e873497f58f722aa224/lumina_next_t2i/models/model.py#L944-L952

Alpha-VLLM / Lumina-T2X

What watershed do in precompute_freq_cis function #73