Alpha-VLLM / Lumina-T2X

Lumina-T2X is a unified framework for Text to Any Modality Generation
MIT License
1.82k stars 74 forks source link

What watershed do in precompute_freq_cis function #73

Open yjhong89 opened 1 week ago

yjhong89 commented 1 week ago

Hi. Thanks for sharing great works!

I wonder what is the role of scale_watershed in https://github.com/Alpha-VLLM/Lumina-T2X/blob/7bc7d7d70a20a262b4f04e873497f58f722aa224/lumina_next_t2i/models/model.py#L921 ?

ChrisLiu6 commented 1 week ago

In short, it is a watershed w.r.t. time step, before which position embedding is linearly scaled, and after which position embedding is NTK scaled.

More details: to make a model trained at 1k resolution to generate images at 1.5k or higher resolutions, an extrapolation on position embedding (i.e. RoPE in Lumina) is needed. We find that linear RoPE scaling leads to good global structure and composition, but the nearby pixels tend not to be harmonious; In contrast, NTK scaling makes good local texture, but global structure is usually unreasonable. Therefore, we use a combination of them two, applying linear scaling in the initial diffusion steps to define the global composition (intuitively like to draw a draft), and then switch to NTK for high-quality texture. It follows the same intuition as the method introduced in Sec 2.2 of the Lumina-Next paper but usually behaves more stably.

This method is very simple w.r.t. implementation https://github.com/Alpha-VLLM/Lumina-T2X/blob/7bc7d7d70a20a262b4f04e873497f58f722aa224/lumina_next_t2i/models/model.py#L944-L952