IrisRainbowNeko / DreamArtist-sd-webui-extension

DreamArtist for Stable-Diffusion-webui extension
Apache License 2.0
691 stars 52 forks source link

Any successful result replication? #18

Open bycloudai opened 1 year ago

bycloudai commented 1 year ago

Hey guys, I am just wondering if anyone has successfully replicated the 1 image embedding and recreated similar results from 7eu7d7? Right now I have no luck testing it myself.

Training time for the embedding takes around 2.5 hours on my 3090 GPU for 8000 steps. Some results only resemble a bit of that 1 training image.

78Alpha commented 1 year ago

Haven't had any great results myself, but that is testing on my own style, which seems to be adjacent to most of the models.

It is odd that you're at 2.5 hours, I am at 40 minutes per 4000 steps.

For replication it may also matter that Xformers is enabled for Textual inversion, that may make it impossible to replicate 1-to-1 if it were done.

Edit, Adding comparison images

Original 00001-0-Tau Karma Good Square (20220605063437)

What it generates 00014-1909859057

JaredS215 commented 1 year ago

I've tried with 15 images and 5 images and not had any success with it learning a person. I think there are far too many variables that are not explained on what they should be set.

henryvii99 commented 1 year ago

The embedding can influence image generation, but fails to replicate my waifu. Need some more testing probably.

zhupeter010903 commented 1 year ago

I used 22 images but they are basically one image flipped and cropped into different sizes and with some small alternates, for example, with or without a hat. I got decent results using 6 vectors for both positive and negative prompt, learning rate 0.0025, cfg scale 3, reconstruction loss weight 1, negative lr weight 1, custom prompt template that does not use filewords, image size 384x384 and 10000 steps.

JaredS215 commented 1 year ago

I used 22 images but they are basically one image flipped and cropped into different sizes and with some small alternates, for example, with or without a hat. I got decent results using 6 vectors for both positive and negative prompt, learning rate 0.0025, cfg scale 3, reconstruction loss weight 1, negative lr weight 1, custom prompt template that does not use filewords, image size 384x384 and 10000 steps.

I tried without filewords in the template and it heavily copied the backgrounds for the subject to the point that sometimes it would just generate a landscape without any people at all. That was with 6000 steps or so before I canceled.

zhupeter010903 commented 1 year ago

I used 22 images but they are basically one image flipped and cropped into different sizes and with some small alternates, for example, with or without a hat. I got decent results using 6 vectors for both positive and negative prompt, learning rate 0.0025, cfg scale 3, reconstruction loss weight 1, negative lr weight 1, custom prompt template that does not use filewords, image size 384x384 and 10000 steps.

I tried without filewords in the template and it heavily copied the backgrounds for the subject to the point that sometimes it would just generate a landscape without any people at all. That was with 6000 steps or so before I canceled.

My image has a simple white background so maybe that's why. Also 7eu7d7 once mentioned that this algorithm works best when clip is set to 1 instead of 2, which most people seems to use nowadays. I've always used clip 2, but maybe you can check that.

I should also mention that even though the negative prompt seems to be able to improve the general quality, it may also cause mosiac or distorted shapes in background and distorted hands in my example.

Edit: And I should include that I didn't use filewords only because it always throw an error when I tried to.

bycloudai commented 1 year ago

I tried training with a single image, I don't see any obvious visual improvements weirdly.

Every parameter here is the same except for using Anythingv3 for the model image Clip skip 1, anythingv3 vae, no hypernetwork, xformer on, reconstruction on

the prompt template file is [name] with purple eyes and purple hair wearing a purple kimono outfit standing in a field of flowers with a purple sword in her hand and a purple butterfly flying around her, by Masaaki Sasamoto

training image is the character Raiden Shogun from Genshin

my preview results every 500 steps: https://imgur.com/a/3qe6Fb3 my loss: https://imgur.com/a/unEBSD2

Nothing resembles the training image, most notably feature was that the subject is standing but most preview is sitting

Attempting to replicate Nahida rn since working on my own has failed various times.

zhupeter010903 commented 1 year ago

I tried training with a single image, I don't see any obvious visual improvements weirdly.

Every parameter here is the same except for using Anythingv3 for the model image Clip skip 1, anythingv3 vae, no hypernetwork, xformer on, reconstruction on

the prompt template file is [name] with purple eyes and purple hair wearing a purple kimono outfit standing in a field of flowers with a purple sword in her hand and a purple butterfly flying around her, by Masaaki Sasamoto

training image is the character Raiden Shogun from Genshin

my preview results every 500 steps: https://imgur.com/a/3qe6Fb3 my loss: https://imgur.com/a/unEBSD2

Nothing resembles the training image, most notably feature was that the subject is standing but most preview is sitting

Attempting to replicate Nahida rn since working on my own has failed various times.

This is just my experience, but you probably shouldn't include any feature you want the model to learn in the template.

knoopx commented 1 year ago

tried it today with myself, got similar results. some generations would just render a landscape and others a lot of identity loss (3/6 vectors, 0.005 lr, 3000 steps)

henryvii99 commented 1 year ago

For those who can generate valid images in logs but fail to replicate in txt2img, say if your file is called zzzz1234 your prompt is "art by zzzz1234", not just the file name. You can see this prompt during training.

bycloudai commented 1 year ago

I tried again training with a single image, in an attempt to replicate the results of ani-nahida that 7eu7d7 have shown on README. I used the exact image to train. I named this embedding ani-greengod.

Every parameter is the same except for using Anythingv3 for the model; clip skip 1, anythingv3 vae, no hypernetwork, xformer on, reconstruction on, and also included the suggestion from @zhupeter010903 where I use an empty template

The results definitely don't look like the provided ani-nahida.pt that 7eu7d7 provided.

my preview results for every 500 steps: https://imgur.com/a/LyQqYmM the results of my replication attempt: https://imgur.com/a/nUxRnjh

Here's a comparison with only using the TI without any other text prompt xy_grid-0057-50-5-None-DDIM-3360683758-6569e224-20221116151144

cookedfuwan commented 1 year ago

Same for me 5000 step and all I got is something that resemble the girl but nothing of the detail like in the readme. If this not gonna work then is quite sad honestly.

bladedsupernova commented 1 year ago

Same for me 5000 step and all I got is something that resemble the girl but nothing of the detail like in the readme. If this not gonna work then is quite sad honestly.

Guys...I've gotten it to work GREAT...even for mario video game case and cool aliens by hr giger, giving me similar photos!

Take a look at my results. I have very VERY good ones HOWEVER I THINK all I can post here is Safe For Work photos? YOU gotta see them! Anyway let me know here is just one for show! HMU! input pic: https://ibb.co/hZrs0jq results!: SIMILAR IMAGES: https://ibb.co/dkkW0qy VERY HQ IMAGES KIND OF SIMILAR https://ibb.co/nC85xGp

all you do is go here: https://www.kaggle.com/code/miolovers1/stable-diffusion-automatic1111 and replace the upper !git clone line with this: !git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui the link appears in 2nd last cell believe it or not, it brings you to the app!

bladedsupernova commented 1 year ago

Ok now see above comment, I had to edit is lots sorry haha. Let me know you have seen this comment else maybe you saw it before editing...

oh and the link appears in 2nd last cell believe it or not, it brings you to the app!

bladedsupernova commented 1 year ago

oh and the link appears in 2nd last cell believe it or not, it brings you to the app!

78Alpha commented 1 year ago

ani-g | animefull-latest |   | 3, 10 | 1500 | 0.003 | 5

The original image in the github 00000-0-g

My closest result to the final output 00001-3129058567

Stuck in a painting style for all of them.

Following the S*, light blue hair, forest, blue butterfly, cat ears, flowers, dress example using the same model as well

bladedsupernova commented 1 year ago

See my message above, my results look good.

bladedsupernova commented 1 year ago

One of my best HD 3D ones (3D!!), its img2img from 2D: https://ibb.co/XXpP9mN I am adding prompts to get results. Etc.

What is interesting? It does seem to be using images, see her hands up the same way in 2 pics? IDK, maybe this could be the parameters are sameish though. But, you can get it really modified, notice she is much different in another photo though! Different clothes etc!

bycloudai commented 1 year ago

@bladedsupernova do you mind sharing your exact parameters in training the embeddings?

IrisRainbowNeko commented 1 year ago

The original instructions are quite rough and you may not use DreamArtist correctly. You can use it following the new instructions.

zhupeter010903 commented 1 year ago

The original instructions are quite rough and you may not use DreamArtist correctly. You can use it following the new instructions.

Is there any suggested value for the reconstruction loss weight and negative lr weight?

Edit: and also for template files?

IrisRainbowNeko commented 1 year ago

The original instructions are quite rough and you may not use DreamArtist correctly. You can use it following the new instructions.

Is there any suggested value for the reconstruction loss weight and negative lr weight?

Edit: and also for template files?

reconstruction loss weight and negative lr weight can be set to 1.0. In fact, you can get decent results without reconstruction in most cases, adding the reconstruction can be much slower and the improvement is not very huge.

template files better to use the verison without filewords.

bladedsupernova commented 1 year ago

@bladedsupernova do you mind sharing your exact parameters in training the embeddings?

I didn't train it to get those great results. I simply used the link I gave is all. I do set it to 150 (steps/refinement thingy), and play a bit with the CFG 7 up/down. And put in a long prompt, and neg prompt.

IrisRainbowNeko commented 1 year ago

I tried again training with a single image, in an attempt to replicate the results of ani-nahida that 7eu7d7 have shown on README. I used the exact image to train. I named this embedding ani-greengod.

Every parameter is the same except for using Anythingv3 for the model; clip skip 1, anythingv3 vae, no hypernetwork, xformer on, reconstruction on, and also included the suggestion from @zhupeter010903 where I use an empty template

The results definitely don't look like the provided ani-nahida.pt that 7eu7d7 provided.

my preview results for every 500 steps: https://imgur.com/a/LyQqYmM the results of my replication attempt: https://imgur.com/a/nUxRnjh

My ani-nahida and ani-nahida-neg is trained on animefull-latest, but training with anythingv3.0 should also get good results. You can try to refer to the new instructions.

bycloudai commented 1 year ago

@7eu7d7 someone that trained some really good TI actually suggested anythingv3 doesn't do as good as using a mix model between 0.2anime-full&0.8wd, will try both. Just wanna make sure, when you said no filewords, I'm assuimg the PTF is empty? Thanks for your great work!

IrisRainbowNeko commented 1 year ago

no filewords

Just use style.txt or subject.txt. Train with filewords may excluded the described features.

tsukimiya commented 1 year ago

@7eu7d7 I still have a few questions as I have yet to achieve much that can be called a success.

  1. Is the prefix attached to the Embedding Name in the sample to avoid duplication with other nouns and to give a unique noun?
  2. In the sample ani-nahida, embedding length is set to (3, 6) and cfg scale to 3. Is there a guideline for determining these values?
  3. What are the advantages of enabling the "Train with reconstruction" option?

Thanks for developing these great features.

kou201 commented 1 year ago

no filewords

Just use style.txt or subject.txt. Train with filewords may excluded the described features.

Hi, does this conclusion apply to the original TI?

Because I've seen similar conclusions in https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/1528#discussioncomment-4044422

It's really a very counter-intuitive conclusion, which makes it hard to believe. But through my test I found that it seems to be correct.

Does this mean that almost all tutorials nowadays are wrong about filewords ......?

bycloudai commented 1 year ago

I had a decent run in replication, however the results doesn't look like the training image of Nahida.

This is the .pt that 7eu7d7 provided xy_grid-0064-30-7-None-Euler a-1662428442-6569e224-20221117183202

This is the .pt that I generated xy_grid-0065-30-7-None-Euler a-1662428442-6569e224-20221117183224

I follow the exact instructions that @7eu7d7 suggested, the only thing that is different is probably just the model that was trained on. I speculate this difference might be due to the model I trained based on, so will do it again.

Also, is the training image of Nahida 512x512 or did you throw the uncropped image that is 1417x1417? Currently I am using 1417x1417 that I downloaded from this github.

zhupeter010903 commented 1 year ago

no filewords

Just use style.txt or subject.txt. Train with filewords may excluded the described features.

Hi, does this conclusion apply to the original TI?

Because I've seen similar conclusions in https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/1528#discussioncomment-4044422

It's really a very counter-intuitive conclusion, which makes it hard to believe. But through my test I found that it seems to be correct.

Does this mean that almost all tutorials nowadays are wrong about filewords ......?

From my understanding, filewords should describe attributes and elements of the corresponding image which you do not want the TI to learn. For example, if you want to learn an character, and you have an image of the character sitting on meadow in front of a forest, then you can include "siting, meadow, forest" in filewords.

In a simple experiment, I used an image cropped into different sizes as the dataset, and my filewords are simply portrait, cowboy shot, full body, etc. describing the portion of the character visible in the image, and it outperforms the plain subject.txt case.

bycloudai commented 1 year ago

Ok so it turns out the model that was used to train and evaluate matters a lot.

I trained a new embedding based on animefull-latest, called ani-dendro-nai

This is the result when I use animefull-latest with the ani-dendro-nai embedding: xy_grid-0067-30-7-None-Euler a-1662428442-925997e9-20221118020129

This is the result when I use anythingv3.0 with the ani-dendro-nai embedding: xy_grid-0073-30-7-None-Euler a-1662428442-6569e224-20221118025945

From what I observed, embedding is better to train on animefull-latest rather than anythingv3.0, but when inferencing the embedding, anythingv3.0 will give u a much better result.

Still can't replicate the very high quality Nahida that 7eu7d7 did though. But can confirm that choosing model to train the embeddings for TI probably applies for dreamartist too.

Omegastick commented 1 year ago

I think we need some more information to exactly replicate @7eu7d7's results (I'm trying to get the Nahida one right at the moment):

EDIT: I left a few training runs going today.

Without edits (just ani-nahida|ani-nahida-neg in the prompt) there were a handful of decent results, even a few "readme quality" if you cherry pick. None of them were editable though. Adding , light purple hair, ice, city in background, snowflakes, night, ice wings, for example degrades image quality significantly without making any changes to the content of the image.

Here's the base settings I used for all the runs:

model: animefull-latest
initialization text: girl
vectors per token: 3, 6
learning rate: 0.0025
negative lr weight: 1
batch size: 1
prompt template file: subject.txt
width/height: 512
vae: None
steps: 8000

All images below are made with settings Steps: 20, Sampler: Euler a, CFG scale: 6, Size: 512x512, Clip skip: what they were trained with

Clip skip 2: nahida_clip_2

Clip skip 1: nahida_clip_1

Clip skip 1 had some kind of okay results at 3500 steps: nahida_clip_1_3500_768

I also tried two runs with reconstruction loss, but both failed (started producing images of inanimate objects) around 1500 steps.

I ran the clip skip 1 run twice, to check for consistency, and the results were pretty much the same between the two runs, so there probably isn't much luck involved.

I'm curious as to why my results are so different to @bycloudai's. VAE maybe?

Omegastick commented 1 year ago

no filewords

Just use style.txt or subject.txt. Train with filewords may excluded the described features.

Hi, does this conclusion apply to the original TI?

Because I've seen similar conclusions in https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/1528#discussioncomment-4044422

It's really a very counter-intuitive conclusion, which makes it hard to believe. But through my test I found that it seems to be correct.

Does this mean that almost all tutorials nowadays are wrong about filewords ......?

Yes and yes.

It might not apply to hypernetworks, perhaps? I'm not sure why, though.

wansf3 commented 1 year ago

tried many times using tutorials setting but not even close, please explain detailer, how does your dataset looks like and everything around vae, clip skip, xformer and did you edit anything in style.txt or just the vanilla one? thanks @7eu7d7

Omegastick commented 1 year ago

Did a training run with EMA (positive) at 0.97 from the latest commits and there's a bit of editability.

ani-nahida: tmp3ygam_85.png

ani-nahida, light blue hair: tmpij3_7xir.png

Didn't have much luck with other changes, though.

VAE is off for these previews.

I'm gonna do a run with EMA (negative) at 0.97 too later.

EDIT: EMA (positive): 0.97 + EMA (negative): 0.97 was no good: tmpcr0uqm36

IrisRainbowNeko commented 1 year ago

Still can't replicate the very high quality Nahida that 7eu7d7 did though. But can confirm that choosing model to train the embeddings for TI probably applies for dreamartist too.

The Nahida images I provide is actually training on animefull-latest and inference on anything 3.0. This does work better.

IrisRainbowNeko commented 1 year ago

I think we need some more information to exactly replicate @7eu7d7's results (I'm trying to get the Nahida one right at the moment):

I trained with CLIP=1 on animefull-latest and did not load VAE. The inference is on anything 3.0.

The new version add the EMA training and setting it to 0.97 or 0.95 may make training more stable.

IrisRainbowNeko commented 1 year ago

tried many times using tutorials setting but not even close, please explain detailer, how does your dataset looks like and everything around vae, clip skip, xformer and did you edit anything in style.txt or just the vanilla one? thanks @7eu7d7

I trained with CLIP=1 on animefull-latest and did not load VAE with xformers disabled. The inference is on anything 3.0. There is no changes in style.txt.

bycloudai commented 1 year ago

I am still struggling to exactly replicate what @7eu7d7 did, but this is what I have so far, for people that want to get some decently high quality results out of DreamArtist. There are a lot of little things that you need to get right in order to get decent results, and the training takes HOURS. So remember to do it correctly or else you'll make mistakes like me where I had to retrain a lot of times becuase I had the wrong model or had vae on. (@7eu7d7 大佬 please correct me and teach me the ways to replicate your result)

Here is my most recent and the most successful replication & comparisons

xy_grid-0103-30-9-None-Euler a-3908260091-6569e224-20221119170553

ani-dendro-afl-sty & ani-dendro-afl-sbj are trained with the following:

The parameters I used when evaluating the embeddings:

The PTF made some visible differences, but I couldn't come to any concrete conclusion as to what is different and why. Here is a more detailed comparison:

Subject.txt Style.txt
xy_grid-0099-30-9-None-Euler a-3908260091-6569e224-20221119165655 xy_grid-0098-30-9-None-Euler a-3908260091-6569e224-20221119165239

It is also interesting that my replication has probably the best visual at around 3500~4000 steps. I have no idea why.

It is also important to have the CORRECT corresponding -neg TI in the negative text prompt, as it will make the result slightly more consistent and coherent.

without -neg with -neg
xy_grid-0096-30-9-None-Euler a-3908260091-6569e224-20221119163625 xy_grid-0099-30-9-None-Euler a-3908260091-6569e224-20221119165655

Wrong corresponding -neg will make your image look like this, which is kind of sharpened or thick lined in some cases: xy_grid-0101-30-9-None-Euler a-3908260091-6569e224-20221119170420

Remember to use VAE WHILE inferencing, not while training. As it'll make your result image look more vibrant without VAE with VAE
xy_grid-0086-30-9-None-Euler a-3908260091-6569e224-20221119153719 xy_grid-0088-30-9-None-Euler a-3908260091-6569e224-20221119154230
IrisRainbowNeko commented 1 year ago

please correct me and teach me the ways to replicate your result

Excellent experimental investigation!

I actually training nahida without reconstruction, and EMA is also set to 1.0 thus turning it off. Due to the randomness of latent noise and prompt selection during training, it may be difficult to run the exact same result twice. Even the same random seeds, the results may be different on different devices. However, multiple experiments should be able to reproduce similar results.

rabidcopy commented 1 year ago

I've had some successes with style training so far, but only if I don't use the -neg embedding in the negative prompt. I seem to get better and more expected results just using the positive embedding. (I tried to uncheck creation of the negative embedding entirely but it errors out when it can't find a negative embedding on training, line 370 of cptuning.py I believe)

davizca commented 1 year ago

please correct me and teach me the ways to replicate your result

Excellent experimental investigation!

I actually training nahida without reconstruction, and EMA is also set to 1.0 thus turning it off. Due to the randomness of latent noise and prompt selection during training, it may be difficult to run the exact same result twice. Even the same random seeds, the results may be different on different devices. However, multiple experiments should be able to reproduce similar results.

I dont know Im trying a concept art picture in SD 1.4 and it's getting nowhere...

Settings (for 1 image): name: test1 init: angel vec and neg: 3-6, overwrite. Txtimg: masterpiece, best quality, test1, angel Neg: test1-neg, lowers, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry Steps 30 Euler A and Batch size 4 with 512x512 Lr: 0.003 (also tried 0.0025 <- Looks even worse) HNLr: 0.00001 Reconstruction Scale: 5 (changed to 3 <- Looking worse) Style.txt 10000 steps max (tried 20000, nothing) Emas 0.95-0,95 25 stepsx 2

Can you explain me the red style one setting or the dog one? Just to make sure Im doing this right

bladedsupernova commented 1 year ago

I'm attempting to train on the Gradio app, see screenshot below, but I don't understand where do I get my directory from, very lost here really.

https://ibb.co/8XzTZRB

Also no neg prompt embed on tab1 at left either. Nor reconstruction bar to set to max.

Also if I click Create Embed, is that it and I just have to wait a few hours to train on 1 input pic i want variations of? Then I use the tag word I made to trigger her up in pics?

TiagoTiago commented 1 year ago

I'm attempting to train on the Gradio app, see screenshot below, but I don't understand where do I get my directory from, very lost here really.

https://ibb.co/8XzTZRB

Also no neg prompt embed on tab1 at left either. Nor reconstruction bar to set to max.

Also if I click Create Embed, is that it and I just have to wait a few hours to train on 1 input pic i want variations of? Then I use the tag word I made to trigger her up in pics?

Are you running the latest version of Auto?

bladedsupernova commented 1 year ago

Are you running the latest version of Auto?

I'm using this, version 54: https://www.kaggle.com/code/miolovers1/stable-diffusion-automatic1111

RainehDaze commented 1 year ago

Been trying to get this to work all week, and I've gotten nothing in all that time that's remotely usable. I've tried single images, I've tried multiple images, I've tried images with cleaned backgrounds, and images without them; images without any filewords to guide it, and images with filewords describing what's in it, and the usual textual inversion practice of filewords describing what's not part of the desired subject. I've tried to learn a style or a character. EMA set to 1, or EMA set to 0.97. Lots of tokens or no tokens. Xformers off, full precision, no VAE, multiple different models, reconstruction on or off.

Every single result has been basically random. It finds some initial loss value, and just bounces around that the entire time. I've seen training with a stable loss of 0.08, but the output has nothing in common with the subject. It's baffling.

Edit: It seems that it might be a bug? I used one of the more specific fixes to get the embedding button to work, but only when I changed to the jjolton repo did I start to get some viable results. Going to keep testing.

Edit 2: Seems like it's working better than ever, now. It actually appears compatible with the new xformers version (which was bugged before), which helps even more. The one strange thing is very large functional batches mean zero learning; but more normal batch sizes (e.g. functionally 32) actually work. And, obviously, single image generation can still work.