AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.12k stars 26.4k forks source link

[Feature Request]: custom depth2img map #5683

Open loboere opened 1 year ago

loboere commented 1 year ago

Is there an existing issue for this?

What would your feature do ?

It would be interesting to use a depth map for custom depth2img or generated with programs like blender or maya for better fidelity

Proposed workflow

upload your depth map option or drag your depth map

Additional information

No response

ClashSAN commented 1 year ago

you can do that with an extension: https://github.com/Extraltodeus/depthmap2mask/issues/17

AnonymousCervine commented 1 year ago

I've been playing with hacking this sort of functionality manually in the OG Stable Diffusion 2 code and have been having a blast with it. I thoroughly endorse this as a worthwhile thing to be able to support doing.

(My code is too messy to hand to anyone, it just loads depth from .EXRs filled with linear depth data and shunts their data in in place of the generated MiDaS depth-maps, with some scaling code to make them vaguely the right shape. I don't think I've even done the math correctly, for all that the results are nice. And, of course, I'm working with the base 2.0 repo, not this one.)

@ClashSAN

Unless I am mistaken, the depthmap2mask extension operates fundamentally differently from the depth2img model?

The former is (to my understanding) a cute hack where you take depth information and convert it to an image to use as a regular mask to regular img2img—quite useful if, say, you want to change only the background or the foreground of an image!

By contrast, depth2img takes the depth info from MiDaS, converts it to a 64x64 float array normalized across -1 to 1, and feeds it to the depth2img model itself. It's trained to thereby maintain approximate structure, but no particular portion (and indeed can be set to retain none) of the colour/pixel info of the original image.

(Notably, from my experiments, even though the SD authors did not seem to have this in mind, one can give it totally-mismatching data—the depth from one image and the colour from another—and it will actually produce surprisingly lucid results from what little testing I've done so far on landscape/interior design scenes)

AnonymousCervine commented 1 year ago

Following up to give a very rough idea of what this could enable / what it looks like, taken from my own work.

(For context, I'm trying to make animations using this process, which is reasonably quixotic because of how SD responds to small perturbations in input—but the animation itself does incidentally show pretty well how closely it does-or-doesn't follow the depth data its given!)

Note that my own code is definitely slightly-off in places, so there might be some stuff here that's sub-optimal in terms of the result.

WARNING Potentially-eye-searing videos / Epilepsy warning(!) [Click to expand] Prompt: `"The Factory", Octane Render` (Prompt deliberately chosen to be vaguely similar to and compatible with, but still distinct from, the subject matter in the depth images, to see whether structural-suggestion from the depth or subject-suggestion from the text would dominate the result) **Top-right** is the depth frames from a Blender scene (normalized to black and white for the purposes of this video) **Top-left** is those depth frames + the prompt. Without an input img in the img2img, everything defaults to grey-ish if it's not nudged by the prompt. Dunno if that's an intrinsic limitation or if I'm doing it wrong (both are quite possible!) **Bottom-left** is the img2img input, used in **Bottom-right** to give lighting guidance (and composition guidance, esp. for faraway depth values, e.g. through the windows, though that's not seen much in this scene). One will notice SD comes up with something more-plausible when the sun is coming through the *windows* rather than through the *walls* like it does in the second video! But it tries its best to make sense of it anyway. https://user-images.githubusercontent.com/118340091/207307225-700be7ba-c22b-4956-9e9b-f74902dec6f0.mp4 https://user-images.githubusercontent.com/118340091/207306886-0cc30042-5769-4ccf-bb99-99584a292ad8.mp4
ClashSAN commented 1 year ago

wow!! haven't understood everything going on here, I haven't actually used the model.

So, usually with img2img you'd want a slight variation of your user-submitted input picture. you can specify your seed so your output draws some of its nature from an existing picture you found in SD.

Your demo shows 2 user inputs: depthmap, and user-submitted picture. Will the third (existing SD picture as random or fixed seed) factor have any effect?

I know existing img2img is merging 2 things, user-submitted picture and existing SD picture.

Is that bottom left picture an existing (fixed-seed) SD image, or user-submitted image?

AnonymousCervine commented 1 year ago

@ClashSAN

First, sorry for the belated response. I was going to render something to illustrate a point, but before I got to that my GPU broke(!), and it of course took a bit to figure out what I was replacing it with, and then... on the day I was setting up my new GPU my internet broke (and the telecommunications repair-people don't move too quickly, in those last couple weeks of the year, it seems).

Anyway!

To clarify: The bottom-left picture is a user-submitted image (though in this case it happens to be one previously made in SD); it's used as the input picture to the outputs in bottom-right video.

As far as seeds go, everything here (that is, both top-left and bottom-right sequences) is on the same seed, just with depth input and image input varied. (So the bottom-right video is influenced by three things (the source image, the depth data, and the seed. By contrast, the top-right is only influenced by two things: The seed and the depth data.)

(Seeds' effects are clear as one would expect, even if not shown above. For comparison, here's the setup for the first frame of the first movie in the bottom-right above, but on some different seeds:

A few images [Click to expand] ![image](https://user-images.githubusercontent.com/118340091/210188386-723b7769-32b5-4d0d-9e83-871d1189d0d2.png) ![image](https://user-images.githubusercontent.com/118340091/210188405-01aed832-1c9a-4012-aaa0-61f38bcc261e.png) ![image](https://user-images.githubusercontent.com/118340091/210188265-4985bf94-e27b-41bd-99d5-515dd05c191a.png)

You'll notice for instance that they don't all share the movie seed's salmon/peach coloured interpretation of how to redraw the sunset scene)

As for the rest:

So, usually with img2img [...] you can specify your seed so your output draws some of its nature from an existing picture you found in SD.

Without really trying it, I'm unsure how well this works in practice since depth2img model isn't the same model, and I'm not totally sure it does all the same stuff with the same seed? And/or if it does, it also takes a lot of instruction from the depth info (which makes it harder for me to eyeball just looking at a couple pics).

If you started with something you made with depth2img, mind, sure. But, honestly, if it's small variations you're after, in any case it's probably more on-brand to use depth2img in the vanilla way (i.e. have it generate a depth map with MiDaS and let it use that to preserve macro features)? Having a custom map is more useful for a mix-and-match sort of affair.

I may try to whip together an example of what a more-normal person might use it for, later.

AnonymousCervine commented 1 year ago

On a technical level, I'm probably capable of implementing this if no one else wants to?

What I'm most unclear on isn't the stuff at the heart of the matter, but instead: A) how it would fit into the current UI (noting also, that the options to input or output a depth map only make sense when on this one model) B) whether the demand for it is esoteric enough that it should be an extension anyway

Terapixel commented 1 year ago

Hi @AnonymousCervine! your examples with "The Factory" are amazing!

On a technical level, I'm probably capable of implementing this if no one else wants to?

It would be great to have extention for custom depth map. I waiting it since depth model added to A1111 WebUI. Its would be very useful for 3d artists as me. Because depth map from 3d software have much more detiles than automate generated map from image.

A) how it would fit into the current UI (noting also, that the options to input or output a depth map only make sense when on this one model)

I think it can be just one additional element on the script section of img2img tab: image uploader for depth. So we upload color reference image and mask and other settings from original UI and additional image for depth trough image uploader in the script section.
Also it would also be great if it worked for Batch img2img.

and It enough just warning text "only works with 512-depth-ema.ckpt" in my opinion.

AnonymousCervine commented 1 year ago

Alright, I'll take a shot at it!

Because depth map from 3d software have much more detiles

Just know if you're doing this (er... once it's possible) that the depth map gets shrunk to 64x64 before being fed to the algorithm. (That's mostly okay in my limited testing—like, you'll notice those thin desk-legs generally didn't disappear—in effect, it's seemingly quite capable of reasonably extrapolating details from a fuzzy, scaled-down depth image. But still.)

Terapixel commented 1 year ago

Alright, I'll take a shot at it!

thanks!!!!! I’ll waiting for it!

depth map gets shrunk to 64x64 before being fed to the algorithm

I didn’t know that :( but your results with custom depth are already more detailed than autogenerated ones (for example I took your generated “factory” image as input in depth2img — And lot of desk legs are gone. Also custom depth from 3d software must work better for depth consistency in image sequences.

AnonymousCervine commented 1 year ago

@Terapixel @loboere

Okay, I've made an extension (let me know if anything breaks; is obviously missing from what you expected; etc):

https://github.com/AnonymousCervine/depth-image-io-for-SDWebui

The way I've implemented this is somewhat ugly code-wise. I think that's necessary for a sane implementation right now; I'll try and look into whether there's any reasonable changes, either to my code or to the main repo's, that might make it less awful.

AnonymousCervine commented 1 year ago

...I've fixed a bug that was in the first release.

Aside from that: Tentatively, may need to raise a separate issue for something this extension has revealed to me that I wasn't sure before even was an issue: The way depth2img is set up in the main repo right now does not interact ideally with inpainting.

Okay, so...

A small rambling with images attached [Click to expand] Consider the following input image: A regular txt2img generation of a boy and an, um, "what the heck is that thing": ![download (33)](https://user-images.githubusercontent.com/118340091/212168865-55caf920-ea83-4fe2-8648-c6c898e34d1e.png) And it's corresponding MiDaS estimated depth... ![randommidasdepthimage](https://user-images.githubusercontent.com/118340091/212168748-2227a003-751d-4f5c-86f6-9b2775e7fa5e.png) ...and an inpaint mask drawn over to try and replace the mystery creature with something else: ![download (30)](https://user-images.githubusercontent.com/118340091/212169306-6a7debe9-66c8-4e9e-9f14-95fd610ab2d3.png) Okay, so, ideally, we would want depth2image to *use the depth info of the original image* to do inpainting, because that's seemingly the entire original motivation of the depth2img model (to keep compositional information when redrawing). So I punch in "A boy and his cute, giant dog [bunch of style-words]" and hit generate, and the default implementation gives me: ![download (31)](https://user-images.githubusercontent.com/118340091/212169910-399fd454-1fba-4b12-88e5-6f496141cf03.png) Um? That looks nice and all but is that based on the depth data? ![randommidasdepthimage](https://user-images.githubusercontent.com/118340091/212168748-2227a003-751d-4f5c-86f6-9b2775e7fa5e.png) That... no, not really, it seems to be ignoring it. Well, maybe it's using it if I squint...? Okay, but maybe that's just an inherent limitation for some reason. So, I plug in my override with the MiDaS depth, otherwise on the same seed and prompt and everything: ![image](https://user-images.githubusercontent.com/118340091/212170331-d8bfe8e3-6e61-477f-b9b7-4fd20c285fd7.png) Why, hello there, shape-that-I-recognize! You *can* do it! Good boy! ...Anyway, I haven't followed the path the inpaint operation takes through code at all at the time of this writing but it does seem to be messing up somewhere. It feels... kind of like a bug? Will mull over it, may post an issue later. **EDIT** An example where I masked off more of the image to give it fewer composition hints from the remaining image, where the contrast in results is WAY more obvious: ![image](https://user-images.githubusercontent.com/118340091/212171752-773ad8c2-fc69-477b-9a57-72e8e7963007.png) ![image](https://user-images.githubusercontent.com/118340091/212171796-38b88104-9b3d-414c-9b31-bba8a3f94ace.png) (although lol that second image looks ominous. I said "a boy and his cute dog", SD, not "a nightmare of a boy and his terrifying dog"...! XD)
Terapixel commented 1 year ago

@AnonymousCervine Thank you alot! its work!! Did some quick tests. I'll try something else later!

https://user-images.githubusercontent.com/7255910/212202283-147f09f7-d241-4feb-ab3f-8bda8277eeb9.mp4

AnonymousCervine commented 1 year ago

@Terapixel:

Belatedly: Thank you for sharing! I'm glad if it was at all helpful.

Again, if you find any problems or note anything obviously missing from a good workflow, do feel free to mention it in the repo for the extension. (I'm sorry that batching wasn't in the initial release, incidentally. Did you... put those video frames in by hand? If so, I'm doubly sorry; if not, I'd be interested in knowing what you used!)

Terapixel commented 1 year ago

@Terapixel:

Belatedly: Thank you for sharing! I'm glad if it was at all helpful.

Again, if you find any problems or note anything obviously missing from a good workflow, do feel free to mention it in the repo for the extension. (I'm sorry that batching wasn't in the initial release, incidentally. Did you... put those video frames in by hand? If so, I'm doubly sorry; if not, I'd be interested in knowing what you used!)

Yes your script is wery helpful! Yes, I put this video frame by frame in by hand. And I steel do this in other projects)) So batch Its my dream! Also it will be realy helpful if you could add along with Depth batch - batch upload for mask images. A1111 dont has this feature (also it will be helpful if mask batch would work with any model, not only depth).

AnonymousCervine commented 1 year ago

@Terapixel

FYI, batching should now be implemented (though not for inpaint, though that's still on my radar. Feel free to open a feature-request on the extension's repo for that if you want to prod me. Feedback on whether it's working for your needs or not is welcome, also!)

joeyism commented 1 year ago

@Terapixel I'm trying to get it to work, but I can't figure it out. Where did you put in the color reference for the generation?

enn-nafnlaus commented 1 year ago

IMHO, while this script is great, I think the notion is fundamentally flawed. Since we can only run one script at a time, this means that we can't provide custom depth maps to be used by other scripts, which IMHO is a pretty big limitation. The depth map should go onto the main txt2img and img2img pages, not in a script.

Again, IMHO.

AnonymousCervine commented 1 year ago

@enn-nafnlaus

Honestly, even as the author of the script in question, I generally agree with that assessment.

And in the first place there's ample reason for it to be a first-class feature of sd-webui (that is, in this repo, rather than in an extension). It is, at the very least, an obvious consequence of a models released by Stability AI. (I do, accordingly, believe it's appropriate that this issue is still open!)

I just didn't feel comfortably familiar with sd-webui enough to sort out how to handle the not-insignificant portion of the ui that would need to be modified (only when a depth2img model is selected, and pretty much anywhere that a prompt could be entered, maybe even including the training tabs in a perfect world).

And since then I've added some limited batching functionality to the extension, which complicates things even a quibble more, of course! (Though that functionality could be separated, in a pinch.)

Oh, AND there's the tricky bit where I'm not-at-all confident it would play nicely with all scripts. Outpainting, for example, might well be a disaster out of the box. (Although inpainting probably already breaks on some settings for the same reason). But, enough should work to make it desirable.

Anyway!

The halfway-version would probably be me reworking the extension to jam the depth replacement input into various places in the base UI (as you describe), rather than having it as a script, which I think is all technically-supported by the extensions mechanism (that definitely wouldn't make things cleaner, but it would seem to solve the script-incompatibility problem).

enn-nafnlaus commented 1 year ago

Thanks for the response! Well, if you ever do manage to do it, know that it'll be cheered :)

loboere commented 1 year ago

check this out https://github.com/lllyasviel/ControlNet

enn-nafnlaus commented 1 year ago

check this out https://github.com/lllyasviel/ControlNet

That doesn't look like an Automatic1111 extension? And thus no way to use Automatic1111 scripts with it?

loboere commented 1 year ago

It is not an extension, it is depth2img and more for sd1.5 !, the interface code is easy, it would be good to put a custom load on each of these modalities