Stability-AI / stablediffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
38.78k stars 5k forks source link

Compound Concepts/Linguistic Grouping #222

Open Void2258 opened 1 year ago

Void2258 commented 1 year ago

A big issue with many generations is 'prompt bleeding'. Descriptors meant to be assigned to one concept will apply to the whole image, ie red car producing a car, which may be red, but the red applying throughout the image, not just to the car (to clothing, to hair, to eyes, to background objects that are not cars). While colors are the most common issue here, there are plenty of other situations where the inability to group things into a single concept that will not apply to the whole image causes problems, ie attempting to make a dress made of vines will tend to cause vines everywhere in the image, and possibly not on the dress as the model does not have much training data with dresses and vines together.

The issue is that SD does not understand noun and adjective attachment naturally (ie does not know that a single concept is stated with multiple words in a particular sequence, and treats words individually) and there is not a way to indicate this grouping manually (parenthesis have been use to group for emphasis, but not to group syntactically). To avoid breaking existing prompts, I propose {red car} to indicate that the red should apply only to the car and not to any other part of the image. This would also allow for more precise and detailed images, as it will now be possible to specify different colors, styles, to force concepts that might otherwise not be understood by the model as grouped, etc. for different parts of the image without cross influence.

Example: Photo of a {{plate mail armored} rat}, standing on a {red car}, {modern street} with {{blue banners} flapping} in the wind, {cyberpunk crowd} walks past... This prompt would be a hard to prevent from becoming chaotic under the current system; the red being mentioned early would tend to go everywhere in the image rather than just sticking to the card (possibly even overriding the blue on the banner, unless you weight the blue, which then might overtake the red and back and forth), while the plate mail armor could have difficultly being mapped to the rat (as this is not something commonly in the training set, so most models would tend to end up producing an armored human and a rat separately).

Void2258 commented 1 year ago

Cutoff modules essentially do something similar to what this is about, but in a very complex, more limited, and unintuitive way.