basujindal / stable-diffusion

Optimized Stable Diffusion modified to run on lower GPU VRAM
Other
3.14k stars 469 forks source link

Fix #111: add helpful generation metadata #123

Open sowbug opened 2 years ago

sowbug commented 2 years ago

Fixes #111. EXIF wasn't exactly the right scheme; rather, XMP is recognized across both PNG and JPEG. An open issue is that https://github.com/commonsmachinery/tinyxmp is currently buggy (which is why I'm using my own fork), and it doesn't have a pip installer (there's an open issue for that - https://github.com/commonsmachinery/tinyxmp/issues/2).

But this seems to work, so I think it's a step forward to merge it. Thanks.

bakkot commented 2 years ago

I have a proposal at https://github.com/lstein/stable-diffusion/issues/266 for structuring metadata consistently across forks; would you consider adopting it? If there's other information you think should be included in the metadata I'm happy to update my spec.

fat-tire commented 2 years ago

This :point_up: as we're trying to standardize it across various apps/UIs -- the goal being to not only read metadata from image created by any app/UI but to be able to cross-share json between them as well.

sowbug commented 2 years ago

@bakkot see https://github.com/sd-webui/stable-diffusion-webui/discussions/283#discussioncomment-3586799. I like what you've done with structured metadata. I also believe there's value in having an imperfect solution today. I hope the community pursues both approaches. I don't have the cycles to invest in the "what," but my PR does help with the "how" by providing simple and portable code that anyone is welcome to adapt to whatever standardization emerges (assuming it uses XMP).

sowbug commented 2 years ago

(repeating some points from that other thread) I'm excited that Google Photos and Flickr already index the XMP Description field for searching (try https://flickr.com/search/?text=psychedelic%20mushrooms%20plms), and render it in their UI. This is why I hope that no matter which approach the community takes, we also stuff as much useful metadata as possible into an existing field like XMP Description.

Screenshot from 2022-09-07 08-28-16

Screenshot from 2022-09-07 08-26-26

AltoRetrato commented 2 years ago

Is there a specific reason for using JSON to serialize the data?

I'm currently storing metadata in JPEG images using IPTC (with https://github.com/james-see/iptcinfo3), but saving all parameters (even argv[0]) in the "caption/abstract", so you could copy and paste it into the command prompt to run it. Snippet:

parser.add_argument("--iptc", type=bool, nargs="?", default=True, help="save IPTC data to output image (only for JPG)")
# ...
fn = os.path.join(sample_path, "seed_" + str(opt.seed) + "_" + f"{base_count:05}.{opt.format}")
Image.fromarray(x_sample.astype(np.uint8)).save(
    fp = fn,
    quality = 95,
)
if opt.iptc and opt.format == "jpg":
    info = IPTCInfo(fn, force=True)
    cmd  = " ".join([f'"{x}"' if " " in x else x for x in sys.argv])
    info["caption/abstract"] = f"{cmd}{f' --seed {opt.seed}' if '--seed' not in cmd else ''}"[:2000]
    info["supplemental category"] = ["stable diffusion"]
    info.save(options="overwrite")

Result (as seen in Irfanview): image

I wonder if using this format (reproducing exactly how the parameters are used in the input of the program) would be better, because it would make it easier to reproduce the original image (instead of doing a lot of copy-and-paste and editing)...

sowbug commented 2 years ago

Is there a specific reason for using JSON to serialize the data?

The reason I picked JSON is that I didn't expect people to use the data as a single atomic unit, but rather to pick and choose the interesting arguments in their own invocations of txt2img.py. For one thing, it's dangerous to encourage people to triple-click/copy/paste into shell, because someone might throw something malicious in there like rm -rf /;. For another, there are certain arguments that really should be customized, such as outdir, and that could leak private data (outdir=/home/my_real_name_and_wifi_password/outputs). The data's there (with sensitive parts removed), but it's not urging you to blindly paste it into your shell.

I also want to include the checkpoint # of the model (e.g., ckpt-1.4 etc.). That's pretty important to a perfect reproduction of a given image. But it's not part of the command-line parameters; it's an aspect of the environment/configuration.

Finally, I wanted a format that was both human- and machine-readable. argv isn't really hard for a machine to parse, but JSON is a better format if you're thinking of it as a serialization format rather than a faithful representation of the exact command-line used to generate the image.

All that said, I don't have really strong feelings about it. As long as the data is embedded in the output images, I'm happy.

fat-tire commented 2 years ago

Funny, I've never heard of IPTC before. I think the idea of using json was that it was a near universal structured format that's human readable and can easily be imported/exported to anything in almost any language-- I guess XML is too old and wordy?

Also I think there is a need to include the hash of the weights model, as well as information about the actual app/repository/script being used, including its location, that wouldn't be included necessarily and could break if you tried to use the same prompt on a forked but popular version of stable diffusion. Or vice-versa, if you wanted to use a specialized prompt upstream you would need to know that you cannot and moreover where to get the correct application. So you really want to be able to include as much about what went into making the image as you can, and that may include GPU type, app info, the specific version of the model, etc. Which isn't likely contained in the prompt arguments.

I was originally thinking this could be done as a straight-up URI, but was convinced that would be too long and unwieldy. Anyway, I proposed an in-progress json schema here, just to get started.

Cheers!

AltoRetrato commented 2 years ago

For one thing, it's dangerous to encourage people to triple-click/copy/paste into shell, because someone might throw something malicious in there like rm -rf /;.

True! Removing argv[0] might help a bit, but such exploits will surely be done, anyway (no matter where or how program parameters are shared): "Hey, try this funny prompt: "Beautiful face, cyborg, cursing, %&#$";rm -rf /" - that could be in a JSON file and end up in a shell prompt...

I also want to include the checkpoint # of the model (e.g., ckpt-1.4 etc.). That's pretty important to a perfect reproduction of a given image. But it's not part of the command-line parameters; it's an aspect of the environment/configuration.

Agreed! But not only the model, there are so many forks/versions of stable diffusion available that fully describing the env/config is difficult (and that's one of the reasons for me to include argv[0] in the metadata).

Finally, I wanted a format that was both human- and machine-readable. argv isn't really hard for a machine to parse, but JSON is a better format if you're thinking of it as a serialization format rather than a faithful representation of the exact command-line used to generate the image.

In the end, JSON might be a better format for many reasons, even more if it could be directly used as the input parameters to the program: that would eliminate many of the risks of parameters exploitation. You could simply "load the image", either via GUI or command line, and add/change the parameters you want. So, adding this functionality (directly reading parameters from an image file) would also be important.

All that said, I don't have really strong feelings about it. As long as the data is embedded in the output images, I'm happy.

I think the worst part is HOW to embed the metadata: EXIF, IPTC and XMP are only a few of the standards, with varying degrees of support. I'm using IPTC just because my default image viewer supports it, but I'm not sure if it works as well with other programs / platforms.

fat-tire commented 2 years ago

For one thing, it's dangerous to encourage people to triple-click/copy/paste into shell, because someone might throw something malicious in there like rm -rf /;.

+1 and this is why sanitizing input will be super important...

True! Removing argv[0] might help a bit, but such exploits will surely be done, anyway (no matter where or how program parameters are shared): "Hey, try this funny prompt: "Beautiful face, cyborg, cursing, %&#$";rm -rf /" - that could be in a JSON file and end up in a shell prompt...

This is true of anyone copy/pasting prompts, so the more you can clean up all input data, the better

I also want to include the checkpoint # of the model (e.g., ckpt-1.4 etc.). That's pretty important to a perfect reproduction of a given image. But it's not part of the command-line parameters; it's an aspect of the environment/configuration.

The json schema being discussed includes a hash of and link to the model card/file, fwiw.

In the end, JSON might be a better format for many reasons, even more if it could be directly used as the input parameters to the program: that would eliminate many of the risks of parameters exploitation. You could simply "load the image", either via GUI or command line, and add/change the parameters you want. So, adding this functionality (directly reading parameters from an image file) would also be important.

That's exactly one fo the goals-- to try to reduce as much attack surface as possible... but I do acknowlege the security challenges, no matter which method is used-- and it will be up to anything handling this json to be extraordinarily careful.