[Bug]: A tensor with all NaNs was produced in Unet

GreenTeaBD commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I made a fresh reinstall of automatic1111 today. Normal models work, depth models do not work. They all have the corresponding yaml file and were working on my older, other install of automatic1111.

So when I try to use a depth model I get the error seen in the logs, it tells me to use --no-half to fix it, which, not ideal but I have plenty of vram. If I use --no-half though it still gives me an error, but a different error, also in the logs

Edit: Because the logs mention my gpu may not support half type, my gpu is a 4090

Steps to reproduce the problem

launch webui.bat, img2img, load a depth model, feed it a source image, hit generate, crash

What should have happened?

img2img should have generated an image

Commit where the problem happens

Commit hash: 0f5dbfffd0b7202a48e404d8e74b5cc9a3e5b135

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--xformers --disable-safe-unpickle

deforum-for-automatic1111-webui, sd_save_intermediate_images, stable-diffusion-webui-Prompt_Generator, ultimate-upscale-for-automatic1111 extensions installed

Additional information, context and logs

no --no-half

0%| | 0/9 [00:00<?, ?it/s] Error completing request Arguments: ('task(z8s2gece94605h3)', 0, 'skscody', '', [], <PIL.Image.Image image mode=RGBA size=1920x1080 at 0x280023099C0>, None, None, None, None, None, None, 20, 0, 4, 0, 1, False, False, 1, 1, 9, 0.4, -1.0, -1.0, 0, 0, 0, False, 512, 512, 0, 0, 32, 0, '', '', 0, False, 'Denoised', 5.0, 0.0, 0.0, False, 'mp4', 2.0, '2', False, 0.0, False, '

CFG Scale should be 2 or lower.

\n', True, True, '', '', True, 50, True, 1, 0, False, 4, 1, '

Recommended settings: Sampling Steps: 80-100, Sampler: Euler a, Denoising strength: 0.8

', 128, 8, ['left', 'right', 'up', 'down'], 1, 0.05, 128, 4, 0, ['left', 'right', 'up', 'down'], False, False, False, False, '', '

Will upscale the image by the selected scale factor; use width and height sliders to set tile size

', 64, 0, 2, '', None, '720:576', False, 1, '', 0, '', True, False, False, '

Deforum v0.5-webui-beta

', '

This script is deprecated. Please use the full Deforum extension instead.
\nUpdate instructions:

', '

github.com/deforum-art/deforum-for-automatic1111-webui/blob/automatic1111-webui/README.md

', '

discord.gg/deforum

', '

Will upscale the image depending on the selected target size type

', 512, 8, 32, 64, 0.35, 32, 0, True, 0, False, 8, 0, 0, 2048, 2048, 2) {} Traceback (most recent call last): File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 56, in f res = list(func(*args, kwargs)) File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 37, in f res = func(*args, *kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\img2img.py", line 148, in img2img processed = process_images(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 480, in process_images res = process_images_inner(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 609, in process_images_inner samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 1016, in sample samples = self.sampler.sample_img2img(self, self.init_latent, x, conditioning, unconditional_conditioning, image_conditioning=self.image_conditioning) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in sample_img2img samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling return func() File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "I:\stable-diffusion\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral denoised = model(x, sigmas[i] * s_in, *extra_args) File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 354, in forward devices.test_for_nans(x_out, "unet") File "I:\stable-diffusion\stable-diffusion-webui\modules\devices.py", line 136, in test_for_nans raise NansException(message) modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try using --no-half commandline argument to fix this.

--no-half 0%| | 0/9 [00:00<?, ?it/s] Error completing request Arguments: ('task(5014z0igs0omk0j)', 0, 'skscody', '', [], <PIL.Image.Image image mode=RGBA size=1920x1080 at 0x203CDE726B0>, None, None, None, None, None, None, 20, 0, 4, 0, 1, False, False, 1, 1, 7, 0.4, -1.0, -1.0, 0, 0, 0, False, 512, 910, 0, 0, 32, 0, '', '', 0, False, 'Denoised', 5.0, 0.0, 0.0, False, 'mp4', 2.0, '2', False, 0.0, False, '

CFG Scale should be 2 or lower.

\n', True, True, '', '', True, 50, True, 1, 0, False, 4, 1, '

Recommended settings: Sampling Steps: 80-100, Sampler: Euler a, Denoising strength: 0.8

', 128, 8, ['left', 'right', 'up', 'down'], 1, 0.05, 128, 4, 0, ['left', 'right', 'up', 'down'], False, False, False, False, '', '

Will upscale the image by the selected scale factor; use width and height sliders to set tile size

', 64, 0, 2, '', None, '720:576', False, 1, '', 0, '', True, False, False, '

Deforum v0.5-webui-beta

', '

This script is deprecated. Please use the full Deforum extension instead.
\nUpdate instructions:

', '

github.com/deforum-art/deforum-for-automatic1111-webui/blob/automatic1111-webui/README.md

', '

discord.gg/deforum

', '

Will upscale the image depending on the selected target size type

', 512, 8, 32, 64, 0.35, 32, 0, True, 0, False, 8, 0, 0, 2048, 2048, 2) {} Traceback (most recent call last): File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 56, in f res = list(func(*args, kwargs)) File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 37, in f res = func(*args, *kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\img2img.py", line 148, in img2img processed = process_images(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 480, in process_images res = process_images_inner(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 609, in process_images_inner samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 1016, in sample samples = self.sampler.sample_img2img(self, self.init_latent, x, conditioning, unconditional_conditioning, image_conditioning=self.image_conditioning) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in sample_img2img samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling return func() File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "I:\stable-diffusion\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral denoised = model(x, sigmas[i] * s_in, *extra_args) File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 354, in forward devices.test_for_nans(x_out, "unet") File "I:\stable-diffusion\stable-diffusion-webui\modules\devices.py", line 136, in test_for_nans raise NansException(message) modules.devices.NansException: A tensor with all NaNs was produced in Unet.

Pedroman1 commented 1 year ago

I was just try ing to figure out why the hell i keep getting this as well

GreenTeaBD commented 1 year ago

I maybe took too long typing this, I see there was a commit about a half an hour ago that seems, maybe, relevant? Going to go try and see.

edit: It does not :(

GreenTeaBD commented 1 year ago

It does work with normal 512-depth-ema I just found. Meaning, it might be related to https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6891

These are depth models I trained myself, and they were trained with an extremely high learn rate (it's what works best for what I'm trying to do), but, yeah, like I was saying these models worked in an earlier version of automatic1111

I did a hard reset all the way back to https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/4af3ca5 (the version I had before that was working) and it does work on that one, I'd go through the whole thing to figure out exactly where this breaks but not enough time for about another week, lunar new years going on.

mezotaken commented 1 year ago

Let me save you some time 9991967f40120b88a1dc925fdf7d747d5e016888 Run with --disable-nan-check but fyi this shouldnt happen normally. when output of a model is all nan's.

ghost commented 1 year ago

I have this issue with the standard sd 2.1 model, using --disable-nan-check removes the error but the output is black

mezotaken commented 1 year ago

Then something else in earlier commit is breaking it, what we see in the error message is just a symptom.

arpowers commented 1 year ago

yo guys, i think this is a bug in xformers: https://github.com/facebookresearch/xformers/issues/631

Jonseed commented 1 year ago

I just started to get this error on last git pull too, running a model I've been using just fine. I have a 3060 12GB card, which I think supports half precision... Using --no-half gives the error "A tensor with all NaNs was produced in Unet." Using --disable-nan-check allows it to work again, but just produces junk (and sometimes a black image). Something in a recent commit broke it. Going to hunt it down...

Jonseed commented 1 year ago

This is a hard bug... I've removed xformers, and gone back to previous commits, and I'm still getting junk outputs...

Jonseed commented 1 year ago

Interesting. Switching to another model, generating an image, and then switching back to the model I want, the error goes away and I get good outputs. So there is something about switching models that makes it work again... (this is on latest b165e34 commit, with xformers on)

arpowers commented 1 year ago

@Jonseed which models are you using? and by junk output what do you mean?

the "black" output is the nan issue from xformers, but there is yet an even more dubious bug causing bad output as you mentioned... I suspect with one of the commonly used models, would be useful to know more.

Jonseed commented 1 year ago

@arpowers it was Protogen Infinity. I switched to SD1.5-pruned-emaonly and then back to Protogen Infinity, and it worked again (good outputs). I haven't got a black image since. I only got black images with the junk (garbled) outputs.

opy188 commented 1 year ago

this bug is driving me crazy. It's on certain models, but the glitch gives you an error, and then everything afterwards is garbled junk. This bug completely bricks these models. how do I revert to a different branch to fix this

ClashSAN commented 1 year ago

@opy188 Install a new webui folder and switch to any older commit after cloning with

git checkout 1234567

and just stick with that version for the type of model you need

ClashSAN commented 1 year ago

sorry

opy188 commented 1 year ago

@ClashSAN I can't seem to git checkout to that different branch. Is there anything else I need to type?

ClashSAN commented 1 year ago

the "1234567" is where you put your chosen commit.

git checkout 4af3ca5393151d61363c30eef4965e694eeac15e

riade3788 commented 1 year ago

Also getting this with protogen 3.4 only

Jonseed commented 1 year ago

I went back several commits, trying half a dozen, and still had problems... Not sure which commit is ok.

Jonseed commented 1 year ago

@opy188 @riade3788 did you try the trick of switching to another model, and then back to your desired model? Does that fix it for you?

Jonseed commented 1 year ago

I wonder if the junk garbled output is related to this bug: "Someone discovered a few days ago that merging models can break the position id layer of the text encoder. It gets converted from int64 to a floating point value and then forced back to int for inference which may cause problems due to floating point errors..."

But that wouldn't explain why switching to a different model, and then back to the merged model makes it work fine again...

saif-ellafi commented 1 year ago

getting this with 2-1 but was working fine with 2-0

DearDhruv commented 1 year ago

It seems the issue is with xformers, I can run it without xformeres on any commit from the latest (which is 12 hours ago this time) to last week's.

Jonseed commented 1 year ago

I'm running with xformers just fine, except that I have to switch to a different model and back for Protogen Infinity to generate good outputs.

Jonseed commented 1 year ago

When I boot up the server, and generate "a cat", I either get the NaN error, or I get this: cat1

Then I switch to another model, and back to Protogen Infinity, and generate "a cat" and get this: cat2

This is with xformers turned on.

swalsh76 commented 1 year ago

Can confirm that as of about a day and a half ago, every third gen I run gives NaN errors. Even using a batch size of 1. Very annoying. Can also confirm it happens regardless of whether or not --xformers is used

Jonseed commented 1 year ago

Another curious thing I noticed this morning is that I'm unable to reproduce past images. When I upload a past image with all same gen parameters and send to txt2img to regenerate, the image is somewhat similar but clearly not the same as I generated just a day or two ago. This isn't just minor xformers indeterminist difference either, it's quite different. Not sure if this is related to the same bug...

Jonseed commented 1 year ago

Here's another interesting thing I've noticed. If I write "a cat" or "_a cat" or "'a cat" or "`a cat" I get junk output. If I write ",a cat" or "&a cat" I get the NaN error. Even if I just change a space, "~a cat" produces junk output, but "~acat" gives NaN error.

So the junk output and the NaN seem to be related somehow, and the specific characters in the prompt affect which you get. Is it this bug where in some merged models the position id layer of the text encoder is broken? And why does switching to another model, and then back to Protogen seem to fix it, and produce good outputs again? (Although I can't seem to reproduce past images...).

Note, after switching models, and then back to Protogen, I can generate with ",a cat" or "&a cat" without a NaN error, so there seems to be a bug in the way the repo is loading models when the server is initialized, which is different than when switching between models.

ata4 commented 1 year ago

I get this error on my custom 2.1 models from EveryDream2Trainer, sometimes also on the 2.1 base model. Bisect revealed 0c3feb202c5714abd50d879c1db2cd9a71ce93e3 to be the cause. Seems like disabling the initialization isn't a good idea for certain models.

Last good commit is a0ef416aa769022ce9e97dcc87f88a0ae9e6cc58

Jonseed commented 1 year ago

@ata4 but a0ef416 is the commit AFTER 0c3feb2 ? if 0c3feb2 is the problem, wouldn't the last good commit be the one before that, 76a21b9 ?

arpowers commented 1 year ago

https://youtu.be/usgqmQ0Mq7g

He discusses this issue

ata4 commented 1 year ago

@Jonseed yes, you're right. It was listed as the previous one in my Git tool for some reason. But still, a0ef416aa769022ce9e97dcc87f88a0ae9e6cc58 actually works fine for me consistently. There may be another bad commit, I had to skip some during bisect, since they didn't launch on start.

Edit: these were skipped: f9c2147, 27ea694, e9f8292

Jonseed commented 1 year ago

@arpowers that video seems to be about the Dreambooth extension...

Jonseed commented 1 year ago

@ata4 if a0ef416 works fine for you, then the bad commit cannot be 0c3feb2, unless a0ef416 reverted 0c3feb2, which I don't think it did.

arpowers commented 1 year ago

Has anyone looked at diffusers? Not to get into it but I believe the issues are coming from changes to that library.

(I also ran into issues running pure scripts like the shivam dreambooth. Diffusers is the only common thread)

Its likely the issue is one of the unversioned models

@AUTOMATIC1111

ata4 commented 1 year ago

@Jonseed I see... it's indeed weird. It should be the other way round to make sense, yet those are the commits that I've checked out to test.

Anyway, I forced the slow initialization method in sd_models.py of the latest commit and now the models can be used without NaN errors. So at least in my case, that commit isn't entirely unrelated to the error.

Edit: if anyone wants to test: replace those lines at line 378:

    sd_model = None

    try:
        with sd_disable_initialization.DisableInitialization():
            sd_model = instantiate_from_config(sd_config.model)
    except Exception as e:
        pass

    if sd_model is None:
        print('Failed to create model quickly; will retry using slow method.', file=sys.stderr)
        sd_model = instantiate_from_config(sd_config.model)

with this:

    sd_model = instantiate_from_config(sd_config.model)

Jonseed commented 1 year ago

@arpowers the diffusers library isn't versioned?

Jonseed commented 1 year ago

@ata4 ok, I think you're onto something there! I tested changing that code in sd_models.py to force "slow" torch weight initialization, and I don't get junk output or NaN errors, I don't have to switch models to get Protogen Infinity to generate good outputs, and I'm able to regenerate past images! That seems like a win!

@AUTOMATIC1111 looks like it might be helpful to provide a setting to users that allows them to opt-in to disabling torch weight initialization to speed up creating SD model from config that was added in 0c3feb2 (or opt-out of disabling it), as it seems to have a significant impact on some models, almost completely breaking them. I'm not sure if that would fix this bug for everyone, but it might. If the error happens for people on the SD2.1 base model, then might want to revert the disabling of torch weight initialization entirely. Maybe the weights need to be initialized for models to work properly.

arpowers commented 1 year ago

@Jonseed it downloads lots of models, tokenizers, clip, etc. The models aren't versioned usually. Basically diffusers just downloads latest version.

Can't be sure yet (as I haven't found the exact change) but this is why I think there is a problem, there is an untracked change occurring.

swalsh76 commented 1 year ago

@ata4 ok, I think you're onto something there! I tested changing that code in sd_models.py to force "slow" torch weight initialization, and I don't get junk output or NaN errors, I don't have to switch models to get Protogen Infinity to generate good outputs, and I'm able to regenerate past images! That seems like a win!

I made this change and my first 20 or so gens worked well, but now I'm back to one out of every 3 runs NaN-Bombing, unfortunately.

Jonseed commented 1 year ago

@swalsh76 hmm, so why would it work for 20 generations, and then stop working... what changed after the 20 generations? Did you try restarting the server?

swalsh76 commented 1 year ago

@swalsh76 hmm, so why would it work for 20 generations, and then stop working... what changed after the 20 generations? Did you try restarting the server?

I'm beginning to believe that my issue is similar to the NaN in this thread but not quite. I'm running a 4070ti and until mid-last-week I was working fine with only --xformers as my commandline arguments, then I started getting the NaN error and yesterday noticed that tacked on to the end of the error spam it suggested using --no-half-vae

That does seem to fix it, but I'm at a loss to explain why I would suddenly need to do that with a 4070ti doing nothing remarkable with my txt2img gens.

NaughtDZ commented 1 year ago

Same error here...Cant training anything

Condor83 commented 1 year ago

I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?

ghost commented 1 year ago

--no-half fixes it for me as well on a 4090, Windows 11, python 3.10.8

NaughtDZ commented 1 year ago

I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?

Well,I am using 3070,and if i add --no-half with --xformers ,training Embedding will give an error about out of memory,so...I just use legacy versions WebUI and it works fine

swalsh76 commented 1 year ago

I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?

I'm running Windows 10 22H2, so probably not.

Stephenitis commented 1 year ago

--no-half did not fix this for me

its broken on some models and I'm unsure why.

M1 max 64gb Ram Ventura 13.2

python : 3.9.16

Some combination of restarting the UI, redownloading the model, and restarting my browser worked. but the bug has reappeared twice for me. I tend to interrupt renders often. TBC

TheGermanEngie commented 1 year ago

I am running a rtx3090 and had the same issue. --no-half fixed it for me so far.....but it doesn't make sense that we would need to do that with this hardware. I'm on windows 11, python 3.10.6. I wonder if this is windows 11 curse?

Nope, I'm running a 3090 and am on W10. --no-half also fixed it for me so far.

Jonseed commented 1 year ago

@Stephenitis on which models is it broken for you? There might be a commonality between these models. Are they pruned models, fp16 models, merged models, ema-only models, ckpt or safetensor models, native SD models?

AUTOMATIC1111 / stable-diffusion-webui