Closed GreenTeaBD closed 1 year ago
I was just try ing to figure out why the hell i keep getting this as well
I maybe took too long typing this, I see there was a commit about a half an hour ago that seems, maybe, relevant? Going to go try and see.
edit: It does not :(
It does work with normal 512-depth-ema I just found. Meaning, it might be related to https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6891
These are depth models I trained myself, and they were trained with an extremely high learn rate (it's what works best for what I'm trying to do), but, yeah, like I was saying these models worked in an earlier version of automatic1111
I did a hard reset all the way back to https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/4af3ca5 (the version I had before that was working) and it does work on that one, I'd go through the whole thing to figure out exactly where this breaks but not enough time for about another week, lunar new years going on.
Let me save you some time 9991967f40120b88a1dc925fdf7d747d5e016888
Run with --disable-nan-check
but fyi this shouldnt happen normally. when output of a model is all nan's.
I have this issue with the standard sd 2.1 model, using --disable-nan-check
removes the error but the output is black
Then something else in earlier commit is breaking it, what we see in the error message is just a symptom.
yo guys, i think this is a bug in xformers: https://github.com/facebookresearch/xformers/issues/631
I just started to get this error on last git pull too, running a model I've been using just fine. I have a 3060 12GB card, which I think supports half precision... Using --no-half
gives the error "A tensor with all NaNs was produced in Unet." Using --disable-nan-check
allows it to work again, but just produces junk (and sometimes a black image). Something in a recent commit broke it. Going to hunt it down...
This is a hard bug... I've removed xformers, and gone back to previous commits, and I'm still getting junk outputs...
Interesting. Switching to another model, generating an image, and then switching back to the model I want, the error goes away and I get good outputs. So there is something about switching models that makes it work again... (this is on latest b165e34 commit, with xformers on)
@Jonseed which models are you using? and by junk output what do you mean?
the "black" output is the nan issue from xformers, but there is yet an even more dubious bug causing bad output as you mentioned... I suspect with one of the commonly used models, would be useful to know more.
@arpowers it was Protogen Infinity. I switched to SD1.5-pruned-emaonly and then back to Protogen Infinity, and it worked again (good outputs). I haven't got a black image since. I only got black images with the junk (garbled) outputs.
this bug is driving me crazy. It's on certain models, but the glitch gives you an error, and then everything afterwards is garbled junk. This bug completely bricks these models. how do I revert to a different branch to fix this
@opy188 Install a new webui folder and switch to any older commit after cloning with
git checkout 1234567
and just stick with that version for the type of model you need
sorry
@ClashSAN I can't seem to git checkout to that different branch. Is there anything else I need to type?
the "1234567" is where you put your chosen commit.
git checkout 4af3ca5393151d61363c30eef4965e694eeac15e
Also getting this with protogen 3.4 only
I went back several commits, trying half a dozen, and still had problems... Not sure which commit is ok.
@opy188 @riade3788 did you try the trick of switching to another model, and then back to your desired model? Does that fix it for you?
I wonder if the junk garbled output is related to this bug: "Someone discovered a few days ago that merging models can break the position id layer of the text encoder. It gets converted from int64 to a floating point value and then forced back to int for inference which may cause problems due to floating point errors..."
But that wouldn't explain why switching to a different model, and then back to the merged model makes it work fine again...
getting this with 2-1 but was working fine with 2-0
It seems the issue is with xformers, I can run it without xformeres on any commit from the latest (which is 12 hours ago this time) to last week's.
I'm running with xformers just fine, except that I have to switch to a different model and back for Protogen Infinity to generate good outputs.
When I boot up the server, and generate "a cat", I either get the NaN error, or I get this:
Then I switch to another model, and back to Protogen Infinity, and generate "a cat" and get this:
This is with xformers turned on.
Can confirm that as of about a day and a half ago, every third gen I run gives NaN errors. Even using a batch size of 1. Very annoying. Can also confirm it happens regardless of whether or not --xformers is used
Another curious thing I noticed this morning is that I'm unable to reproduce past images. When I upload a past image with all same gen parameters and send to txt2img to regenerate, the image is somewhat similar but clearly not the same as I generated just a day or two ago. This isn't just minor xformers indeterminist difference either, it's quite different. Not sure if this is related to the same bug...
Here's another interesting thing I've noticed. If I write "a cat" or "_a cat" or "'a cat" or "`a cat" I get junk output. If I write ",a cat" or "&a cat" I get the NaN error. Even if I just change a space, "~a cat" produces junk output, but "~acat" gives NaN error.
So the junk output and the NaN seem to be related somehow, and the specific characters in the prompt affect which you get. Is it this bug where in some merged models the position id layer of the text encoder is broken? And why does switching to another model, and then back to Protogen seem to fix it, and produce good outputs again? (Although I can't seem to reproduce past images...).
Note, after switching models, and then back to Protogen, I can generate with ",a cat" or "&a cat" without a NaN error, so there seems to be a bug in the way the repo is loading models when the server is initialized, which is different than when switching between models.
I get this error on my custom 2.1 models from EveryDream2Trainer, sometimes also on the 2.1 base model. Bisect revealed 0c3feb202c5714abd50d879c1db2cd9a71ce93e3 to be the cause. Seems like disabling the initialization isn't a good idea for certain models.
Last good commit is a0ef416aa769022ce9e97dcc87f88a0ae9e6cc58
@ata4 but a0ef416 is the commit AFTER 0c3feb2 ? if 0c3feb2 is the problem, wouldn't the last good commit be the one before that, 76a21b9 ?
He discusses this issue
@Jonseed yes, you're right. It was listed as the previous one in my Git tool for some reason. But still, a0ef416aa769022ce9e97dcc87f88a0ae9e6cc58 actually works fine for me consistently. There may be another bad commit, I had to skip some during bisect, since they didn't launch on start.
Edit: these were skipped: f9c2147, 27ea694, e9f8292
@arpowers that video seems to be about the Dreambooth extension...
@ata4 if a0ef416 works fine for you, then the bad commit cannot be 0c3feb2, unless a0ef416 reverted 0c3feb2, which I don't think it did.
Has anyone looked at diffusers? Not to get into it but I believe the issues are coming from changes to that library.
(I also ran into issues running pure scripts like the shivam dreambooth. Diffusers is the only common thread)
Its likely the issue is one of the unversioned models
@AUTOMATIC1111
@Jonseed I see... it's indeed weird. It should be the other way round to make sense, yet those are the commits that I've checked out to test.
Anyway, I forced the slow initialization method in sd_models.py
of the latest commit and now the models can be used without NaN errors. So at least in my case, that commit isn't entirely unrelated to the error.
Edit: if anyone wants to test: replace those lines at line 378:
sd_model = None
try:
with sd_disable_initialization.DisableInitialization():
sd_model = instantiate_from_config(sd_config.model)
except Exception as e:
pass
if sd_model is None:
print('Failed to create model quickly; will retry using slow method.', file=sys.stderr)
sd_model = instantiate_from_config(sd_config.model)
with this:
sd_model = instantiate_from_config(sd_config.model)
@arpowers the diffusers library isn't versioned?
@ata4 ok, I think you're onto something there! I tested changing that code in sd_models.py
to force "slow" torch weight initialization, and I don't get junk output or NaN errors, I don't have to switch models to get Protogen Infinity to generate good outputs, and I'm able to regenerate past images! That seems like a win!
@AUTOMATIC1111 looks like it might be helpful to provide a setting to users that allows them to opt-in to disabling torch weight initialization to speed up creating SD model from config that was added in 0c3feb2 (or opt-out of disabling it), as it seems to have a significant impact on some models, almost completely breaking them. I'm not sure if that would fix this bug for everyone, but it might. If the error happens for people on the SD2.1 base model, then might want to revert the disabling of torch weight initialization entirely. Maybe the weights need to be initialized for models to work properly.
@Jonseed it downloads lots of models, tokenizers, clip, etc. The models aren't versioned usually. Basically diffusers just downloads latest version.
Can't be sure yet (as I haven't found the exact change) but this is why I think there is a problem, there is an untracked change occurring.
@ata4 ok, I think you're onto something there! I tested changing that code in
sd_models.py
to force "slow" torch weight initialization, and I don't get junk output or NaN errors, I don't have to switch models to get Protogen Infinity to generate good outputs, and I'm able to regenerate past images! That seems like a win!
I made this change and my first 20 or so gens worked well, but now I'm back to one out of every 3 runs NaN-Bombing, unfortunately.
@swalsh76 hmm, so why would it work for 20 generations, and then stop working... what changed after the 20 generations? Did you try restarting the server?
@swalsh76 hmm, so why would it work for 20 generations, and then stop working... what changed after the 20 generations? Did you try restarting the server?
I'm beginning to believe that my issue is similar to the NaN in this thread but not quite. I'm running a 4070ti and until mid-last-week I was working fine with only --xformers as my commandline arguments, then I started getting the NaN error and yesterday noticed that tacked on to the end of the error spam it suggested using --no-half-vae
That does seem to fix it, but I'm at a loss to explain why I would suddenly need to do that with a 4070ti doing nothing remarkable with my txt2img gens.
Same error here...Cant training anything
I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?
--no-half fixes it for me as well on a 4090, Windows 11, python 3.10.8
I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?
Well,I am using 3070,and if i add --no-half with --xformers ,training Embedding will give an error about out of memory,so...I just use legacy versions WebUI and it works fine
I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?
I'm running Windows 10 22H2, so probably not.
--no-half did not fix this for me
its broken on some models and I'm unsure why.
M1 max 64gb Ram Ventura 13.2
python : 3.9.16
Some combination of restarting the UI, redownloading the model, and restarting my browser worked. but the bug has reappeared twice for me. I tend to interrupt renders often. TBC
I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?
Nope, I'm running a 3090 and am on W10. --no-half also fixed it for me so far.
@Stephenitis on which models is it broken for you? There might be a commonality between these models. Are they pruned models, fp16 models, merged models, ema-only models, ckpt or safetensor models, native SD models?
Is there an existing issue for this?
What happened?
I made a fresh reinstall of automatic1111 today. Normal models work, depth models do not work. They all have the corresponding yaml file and were working on my older, other install of automatic1111.
So when I try to use a depth model I get the error seen in the logs, it tells me to use --no-half to fix it, which, not ideal but I have plenty of vram. If I use --no-half though it still gives me an error, but a different error, also in the logs
Edit: Because the logs mention my gpu may not support half type, my gpu is a 4090
Steps to reproduce the problem
launch webui.bat, img2img, load a depth model, feed it a source image, hit generate, crash
What should have happened?
img2img should have generated an image
Commit where the problem happens
Commit hash: 0f5dbfffd0b7202a48e404d8e74b5cc9a3e5b135
What platforms do you use to access UI ?
Windows
What browsers do you use to access the UI ?
Mozilla Firefox
Command Line Arguments
Additional information, context and logs
no --no-half
0%| | 0/9 [00:00<?, ?it/s] Error completing request Arguments: ('task(z8s2gece94605h3)', 0, 'skscody', '', [], <PIL.Image.Image image mode=RGBA size=1920x1080 at 0x280023099C0>, None, None, None, None, None, None, 20, 0, 4, 0, 1, False, False, 1, 1, 9, 0.4, -1.0, -1.0, 0, 0, 0, False, 512, 512, 0, 0, 32, 0, '', '', 0, False, 'Denoised', 5.0, 0.0, 0.0, False, 'mp4', 2.0, '2', False, 0.0, False, '
\n \n
\n', True, True, '', '', True, 50, True, 1, 0, False, 4, 1, 'CFG Scale
should be 2 or lower.Recommended settings: Sampling Steps: 80-100, Sampler: Euler a, Denoising strength: 0.8
', 128, 8, ['left', 'right', 'up', 'down'], 1, 0.05, 128, 4, 0, ['left', 'right', 'up', 'down'], False, False, False, False, '', 'Will upscale the image by the selected scale factor; use width and height sliders to set tile size
', 64, 0, 2, '', None, '720:576', False, 1, '', 0, '', True, False, False, 'Deforum v0.5-webui-beta
', 'This script is deprecated. Please use the full Deforum extension instead.
', '\nUpdate instructions:
github.com/deforum-art/deforum-for-automatic1111-webui/blob/automatic1111-webui/README.md
', 'discord.gg/deforum
', 'Will upscale the image depending on the selected target size type
', 512, 8, 32, 64, 0.35, 32, 0, True, 0, False, 8, 0, 0, 2048, 2048, 2) {} Traceback (most recent call last): File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 56, in f res = list(func(*args, kwargs)) File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 37, in f res = func(*args, *kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\img2img.py", line 148, in img2img processed = process_images(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 480, in process_images res = process_images_inner(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 609, in process_images_inner samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 1016, in sample samples = self.sampler.sample_img2img(self, self.init_latent, x, conditioning, unconditional_conditioning, image_conditioning=self.image_conditioning) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in sample_img2img samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling return func() File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in--no-half 0%| | 0/9 [00:00<?, ?it/s] Error completing request Arguments: ('task(5014z0igs0omk0j)', 0, 'skscody', '', [], <PIL.Image.Image image mode=RGBA size=1920x1080 at 0x203CDE726B0>, None, None, None, None, None, None, 20, 0, 4, 0, 1, False, False, 1, 1, 7, 0.4, -1.0, -1.0, 0, 0, 0, False, 512, 910, 0, 0, 32, 0, '', '', 0, False, 'Denoised', 5.0, 0.0, 0.0, False, 'mp4', 2.0, '2', False, 0.0, False, '
\n \n
\n', True, True, '', '', True, 50, True, 1, 0, False, 4, 1, 'CFG Scale
should be 2 or lower.Recommended settings: Sampling Steps: 80-100, Sampler: Euler a, Denoising strength: 0.8
', 128, 8, ['left', 'right', 'up', 'down'], 1, 0.05, 128, 4, 0, ['left', 'right', 'up', 'down'], False, False, False, False, '', 'Will upscale the image by the selected scale factor; use width and height sliders to set tile size
', 64, 0, 2, '', None, '720:576', False, 1, '', 0, '', True, False, False, 'Deforum v0.5-webui-beta
', 'This script is deprecated. Please use the full Deforum extension instead.
', '\nUpdate instructions:
github.com/deforum-art/deforum-for-automatic1111-webui/blob/automatic1111-webui/README.md
', 'discord.gg/deforum
', 'Will upscale the image depending on the selected target size type
', 512, 8, 32, 64, 0.35, 32, 0, True, 0, False, 8, 0, 0, 2048, 2048, 2) {} Traceback (most recent call last): File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 56, in f res = list(func(*args, kwargs)) File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 37, in f res = func(*args, *kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\img2img.py", line 148, in img2img processed = process_images(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 480, in process_images res = process_images_inner(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 609, in process_images_inner samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 1016, in sample samples = self.sampler.sample_img2img(self, self.init_latent, x, conditioning, unconditional_conditioning, image_conditioning=self.image_conditioning) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in sample_img2img samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling return func() File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in