Birch-san / sdxl-play

BSD 3-Clause "New" or "Revised" License
25 stars 2 forks source link

Just want to get in touch #5

Closed TimothyAlexisVass closed 11 months ago

TimothyAlexisVass commented 11 months ago

I think you do interesting things, read your blog and apparently played even your very old games online many years ago.

Birch-san commented 11 months ago

that's nice to know 🙂 like a lot of Flash developers: I moreorless had to stop making games when Adobe and Apple killed Flash. tried a bit to make Web games, but the authoring tools just weren't good enough. I mostly like making games with complex engines or with concepts people haven't seen before. but the addictive nature of the games back then has evolved into a bit of a social problem now. if I make a game again: I will need to find a concept that respects the user's attention and time.

I saw your blog post today. nice analysis!

it's worth knowing that the SD and SDXL scaling_factor (0.18215 and 0.13025 respectively) are kinda… wrong. multiplying latents by this value will not give you a good approximation of a standard Gaussian. what's actually needed is a per-channel scale-and-shift.

OpenAI noticed this, although they computed the scale-and-shift a bit awkwardly (they scale it first the wonky way then compensate).

This will have made it unnecessarily hard for SD to learn the latents. Perhaps it is in some part responsible for the yellow-bias you're seeing. but I have also heard that that's an aesthetic preference (apparently Android phones used a warm colour temperature for their screens at some point, because users prefer warm images, presumably skin tones).

collaborators and myself have measured the true scale-and-shift of SDXL latents, and it is dataset dependent. oxford-flowers has a different distribution to imagenet-1k.

it is interesting to see the theory that channel 3 may be taking more of the responsibility of encoding structure than the other channels. some of the clues here may be useful for deciding how to approach dynamic thresholding.

During inference, the values in the tensor will begin at min < -30 and max > 30 and the min/max boundary at time of decoding is around -4 to 4. At higher guidance_scale the values will have a higher difference between min and max.

looking at min/max can attach too much important to outliers. you might be able to get more representative information about the ranges by looking at the per-channel 99%iles. this will still be image-dependent though (an image with a sky will be more invested in the cyan channel). you'll also want to see how it develops with each sampling step.

the center_tensor thing reminds me a bit of the "mean drift" I was investigating:
https://twitter.com/Birchlabs/status/1632539251890966529
I found that centering the denoised latents on their means on each sampling step, changed the colour temperature and added more high-frequency detail. maybe it influenced a channel responsible for texture. this biased the UNet into predicting more flowers/foliage.

regarding soft_clamp_tensor… I think it's hard to reason about thresholds, since what's normal in a noisy timestep can be abnormal in a low-noise timestep. but yes, I've tried setting per-channel boundaries in my attempts at dynamic thresholding, and it can help, but it's too much to configure and the boundaries will be image-dependent (like do you expect a sky to be present in the image).

overall, I think you've found something that's useful for interactively mastering images. changing colours whilst it's still latent does seem to help keep the colour choices globally coherent. it's interesting that it affects the structure of the image though. maybe need to wait for a later timestep before you start modifying latents.
good work!

Birch-san commented 11 months ago

yup, saw that; thanks. 🙂 Birchlabs is probably the best way to cite me (matches my Twitter and the website). all the ML stuff on Birchlabs is by me (Alex), but yes that has caused confusion before. Jamie mostly works on React Native / Nativescript stuff and software for studying Japanese. but the name dates back to when we made games together.

I'm not famliar with Z-score or DBSCAN.

looks fairly consistent (cooler colour temperature), but I think the white point isn't consistent; some images are so bright that I think they lose dynamic range.

yup, added you on Discord and on LinkedIn. I mainly use Discord; LinkedIn's messenger hasn't been reliable for me.