cloneofsimo / minRF

Minimal implementation of scalable rectified flow transformers, based on SD3's approach
Apache License 2.0
205 stars 13 forks source link

Unused parameters #4

Open zaptrem opened 3 weeks ago

zaptrem commented 3 weeks ago

https://github.com/cloneofsimo/minRF/blob/261859e8b89a4cf5ab7eb35b4a4ffd8037c35ea1/advanced/mmdit.py#L161 https://github.com/cloneofsimo/minRF/blob/72feb0c87d435e9f9d220f34f348ed66c0b6ccec/advanced/mmdit.py#L86

Not used in last layer, should be moved into an if not last statement. Unused parameters make some distributed algos slow and sad: https://pytorch.org/docs/stable/notes/ddp.html#internal-design

Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?

cloneofsimo commented 2 weeks ago

Ah yes, you are correct.

| Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

I just dont find clip embedding useful when I inference with them. Kinda my personal thing. Because muP devides the global learning rate with input dimension, its actually more like 1e-4 in practice for fat layers. For biases or input, its much larger, which is the rationale behind muP.