Unused parameters - Githubissues

cloneofsimo / minRF

Minimal implementation of scalable rectified flow transformers, based on SD3's approach

Apache License 2.0

205 stars 13 forks source link

https://github.com/cloneofsimo/minRF/blob/261859e8b89a4cf5ab7eb35b4a4ffd8037c35ea1/advanced/mmdit.py#L161 https://github.com/cloneofsimo/minRF/blob/72feb0c87d435e9f9d220f34f348ed66c0b6ccec/advanced/mmdit.py#L86

Not used in last layer, should be moved into an if not last statement. Unused parameters make some distributed algos slow and sad: https://pytorch.org/docs/stable/notes/ddp.html#internal-design

Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?

cloneofsimo / minRF

Unused parameters #4