Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.66k stars 133 forks source link

[Bug]: Sampling gone wrong after latest updates #376

Open djp3k05 opened 2 months ago

djp3k05 commented 2 months ago

What happened?

Using same project, exactly same settings, I was getting weird/bad samples. I could not understand way the sudden change in training quality. Till, as a last resort, I tried to generate a Image in A1111, using exactly same seed, same promt, same sampler. Here is the sample in Onetrainer and the sample in A1111: image

So OneTrainer sampling is not getting any more same quality in sampling. And this is very misleading... I thought that something got very wrong while training, but it was not the training fault, it is the sampling fault,

EDIT: to clarify: The first image is generated at step 703 from OneTrainer while training. At this step I saved the .safetensor file and loaded it in A1111 and generated a image using same seed, promt and other settings.

Before the updates made in the last few days, the sampling from Onetrainer worked fine,,, I was getting same images while sampling compared with the models loaded in A1111.

What did you expect would happen?

Fix the sampling...

Relevant log output

No response

Output of pip freeze

No response

O-J1 commented 2 months ago

Edit your post to include the config file that you believe produces good samples vs the config file that produces bad samples. (Just make sure to ctrl + f replace username

djp3k05 commented 2 months ago

Maybe I was not clear enough. The first image is generated at step 703 from OneTrainer while training. At this step I saved the .safetensor file and loaded it in A1111 and generated a image using same seed, promt and other settings. I will edit my initial post to be more clear.

Nerogar commented 2 months ago

Seeds are not compatible between OT and A1111. The images will always be different.

djp3k05 commented 2 months ago

Seeds are not compatible between OT and A1111. The images will always be different.

ok....seeds are not identical. but you could notice the subject quality difference?
And I mentioned that in this run of training I was using same project settings I used few weeks ago. But I was noticing the samples were way worse than my previous run and I tough that the updates had done something to training. But the training is going ok, only the samples does not reflect the reality, the status of model training.

English is not my first language, maybe I'm not explaining the situation as it should, but I think that I was clear enough to point you in the right direction,

djp3k05 commented 2 months ago

Then let me tell you the long story: I started training using same project, but I added a new set of pictures. Even at 7000 steps, the model was ugly and not resembling with my original pictures. The only change was the adding of that new set of pictures. So I removed it (so returning to my original project that I was training a week ago and had perfect results) and started again the training process. Same result, the samples from OneTrainer were ugly (old/disfigured person). At this point I started to believe that latest updates were causing the bad training. I started another training, this time I've put to save the model at the step 703, so I could compare the sample image 703 from OneTrainer with the a image generated from A1111 from the safetensor file saved at the same step. You can see the difference... which is major. Before upgrading OneTrainer (I was few commits behind, so I do not know at witch commit the sampler got affected) I was getting perfect samples, identical to the ones that was I getting from the model in A1111.

So the training is going fine, same as before, only the sampling is performing strange, no tie with the training status/quality.

djp3k05 commented 2 months ago

here is a comparation directly from OT sampling. On the left is a run from older version, in the right we have the latest version. As I said, it is same project, nothing changed. If I load the safetensor file in A1111, I get the same quality as in the first picture. So training is still doing great, only the sampling give ugly/unrealistic versions. And the difference is enormous. Now I can't figure out how the training is going (from sampling), I have to load the model in A1111 to see if it is ok or not.

OT sampling older version vs latest version (same training step): image

djp3k05 commented 2 months ago

Seems I'm getting ignored... In the last few days I made many experiments and come to some conclusions.

Even the first sample is not ok! (sample 0-0-0, so even before starting the training): image

Switching from BF16 to FP16 I got this and stopped the training because I tough that the model got fried out: image

But loading this model to A1111 surprise: image

All tests were done using same project, same concepts, same settings. And in my previous post I also posted the samples from OT before and after. So it was working fine also with the Pony models.

Can you have a look @Nerogar or anybody else at this issue? I really love this project! Is way better than Kohya or any other similar tool.

Nerogar commented 2 months ago

When you say "after latest updates", what exactly do you mean? Which version produced better results?

O-J1 commented 2 months ago

Please remember the following: You are not owed support. You are not only the issue in most repos. There are likely many competing priorities. It’s more often people are preoccupied than maliciously ignoring.

Nerogar commented 2 months ago

I've gone back to version fd53dd3 (from 2024-03-23) and compared the sample outputs at step 0. They are exactly the same as the most recent version. From my point of view it looks like, that model is just broken and unable to produce good images with default sampler settings (DDIM specifically, but others aren't much better). Unless you can provide me with exact steps to reproduce any issues, I won't be able to help.

djp3k05 commented 2 months ago

Please remember the following: You are not owed support. You are not only the issue in most repos. There are likely many competing priorities. It’s more often people are preoccupied than maliciously ignoring.

Sure, I know that! I did not paid for support or for the application, so please don't get me wrong.

When you say "after latest updates", what exactly do you mean? Which version produced better results?

I'm not really sure on which version I was before. Probably somewhere after when the fixes for the samples scrollable page (commit 14cb1fd57a69af0e0ce8e7ae53d2cbe8c618a8a1 Date: Sat Jun 22 03:03:53 2024 -0400).

I tried to use an older commit, (I don't know if this was the right way), by using "git checkout 839c2a66e37497b2e66bde0a60cbc361ddd32797 ." (for sure, this was an older version that my working good version). Before that, I deleted almost all files and folders, including the venv folder. After the commit, I run the install.bat. I started the tests with same project but got same bad results. Hope this info helps.

djp3k05 commented 2 months ago

I've gone back to version fd53dd3 (from 2024-03-23) and compared the sample outputs at step 0. They are exactly the same as the most recent version. From my point of view it looks like, that model is just broken and unable to produce good images with default sampler settings (DDIM specifically, but others aren't much better). Unless you can provide me with exact steps to reproduce any issues, I won't be able to help.

I agree that it is really strange. As you can see in my examples above, I have put OT sampling before and after. So it worked fine then after an update, the sampling gone wrong. And even after returning to a even older version (if I done it right), still same bad sampling. I'm using SDE++ Karas as sampler. But I tested with all the included samplers. I will test with other Pony models and return with feedback.

bingobongo231 commented 2 months ago

I think the problem is that Pony models require clip skip to be 2. I always get messy noise instead of pic in ComfyUI if I don't set clip skip for Pony-based models. Could you please add option to set "clip skip" alongside seed, cfg, etc in the Sample config dialog?

Nerogar commented 2 months ago

You can try to change the "Text Encoder 1/2 Clip Skip" settings and see if that makes a difference. But as far as I know, what people refer to as "clip skip 2" is the default setting of SDXL. So you shouldn't need to change anything.

bingobongo231 commented 2 months ago

You can try to change the "Text Encoder 1/2 Clip Skip" settings and see if that makes a difference. But as far as I know, what people refer to as "clip skip 2" is the default setting of SDXL. So you shouldn't need to change anything.

That worked, thank you! I didn't know that training parameters have influence on sampling.

O-J1 commented 2 months ago

You can try to change the "Text Encoder 1/2 Clip Skip" settings and see if that makes a difference. But as far as I know, what people refer to as "clip skip 2" is the default setting of SDXL. So you shouldn't need to change anything.

That worked, thank you! I didn't know that training parameters have influence on sampling.

Kindly reread what he said:

"But as far as I know, what people refer to as "clip skip 2" is the default setting of SDXL. So you shouldn't need to change anything."

TheIrishAce commented 1 month ago

You can try to change the "Text Encoder 1/2 Clip Skip" settings and see if that makes a difference. But as far as I know, what people refer to as "clip skip 2" is the default setting of SDXL. So you shouldn't need to change anything.

That worked, thank you! I didn't know that training parameters have influence on sampling.

I think I'm seeing similar issues to what's being discussed here and wondering what you changed the clip skip value to for the Text Encoders? Currently mine are 0 so what did you set yours to?

bingobongo231 commented 1 month ago

You can try to change the "Text Encoder 1/2 Clip Skip" settings and see if that makes a difference. But as far as I know, what people refer to as "clip skip 2" is the default setting of SDXL. So you shouldn't need to change anything.

That worked, thank you! I didn't know that training parameters have influence on sampling.

I think I'm seeing similar issues to what's being discussed here and wondering what you changed the clip skip value to for the Text Encoders? Currently mine are 0 so what did you set yours to?

I've set Text Encoder 1 Clip Skip to 2, Text Encoder 2 was disabled.

djp3k05 commented 1 month ago

I've done numerous test/combinations. Setting the TE1 clip skip to 2 will NOT help improving the sampling in OT. It will make the sampling even worse. You will notice that the model will have big opened eyes.... and while training the big eyes will be there, making the images a little bit monstrous. But loading the model in A1111 or any other app, the samples are fine. In those apps, I noticed that using CS2 in TE1 will slightly improve the training.

mx commented 1 month ago

Clip skip is numbered differently in Onetrainer, what a1111 calls "2" we call "1".