[Bug]: NansException: A tensor with all NaNs was produced in Unet. Use --disable-nan-check commandline argument to disable this check.

gurilagardnr commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

--no-half --no-half-vae --precision full Nothing helps. Problems only started with torch2.0.1. xformers doesn't matter. sdp doesn't matter. what checkpoint is used, doesn't matter. with or without a vae loaded. with extensions, or with all extensions unloaded. I've done reinstalls. I've tried main, dev. I can't run 1.5 or sdxl. I can't run automatic1111 or it's forks. I've tried every nvidia driver from 320 till current 536. i've reinstalled windows on a separate drive and installed there. I've done parallel installs of different versions of automatic1111. This worked fine for months until a few days ago, prior to the release of 1.5. I've researched and fought with this all week. I have other computers running this without an issue. Are there any clues out there? No, it's not persistent cond cache. I have painstakingly toggled EVERY setting in the ui, and restarted the server. Sometimes I can generate a few images. Sometimes it goes straight to NaNs on the first attempt. Sometimes it outputs something that has nothing to do with the prompt. Sometimes it generates random fractal images, sometimes it looks like static, other times it's the beige random noise. It always ends in NaNs.

Steps to reproduce the problem

open automatic1111
attempt to generate image
Get NaN error

What should have happened?

an image should have generated without error

Version or Commit where the problem happens

6ce0161689

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Windows

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above)

Cross attention optimization

None

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

I've tried --no-half --no-half-vae --precision full together and seperately.

List of extensions

I've tried with all extension disabled, and all extensions removed from the extensions directory, and all built-in extensions removed and or disabled

Console logs

venv "F:\Stable Diffusion 3\stable-diffusion-webui\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.5.1-55-g25004d4e
Commit hash: 25004d4eeef015d8f886c537d3a5a9f54d07898e
Launching Web UI with arguments: --autolaunch --xformers --no-half --no-half-vae
*** "Disable all extensions" option was set, will not load any extensions ***
Loading weights [6ce0161689] from F:\Stable Diffusion 3\stable-diffusion-webui\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Creating model from config: F:\Stable Diffusion 3\stable-diffusion-webui\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Startup time: 5.6s (prepare environment: 1.3s, launcher: 0.1s, import torch: 1.8s, import gradio: 0.5s, setup paths: 0.5s, other imports: 0.5s, list SD models: 0.1s, scripts list_optimizers: 0.3s, create ui: 0.3s, gradio launch: 0.2s).
Loading VAE weights specified in settings: F:\Stable Diffusion 3\stable-diffusion-webui\models\VAE\vae-ft-mse-840000-ema-pruned.safetensors
Applying attention optimization: xformers... done.
Model loaded in 3.2s (load weights from disk: 0.5s, create model: 0.3s, apply weights to model: 0.8s, load VAE: 0.2s, move model to device: 1.5s).
  5%|████▏                                                                              | 1/20 [00:00<00:05,  3.31it/s]
*** Error completing request                                                                    | 0/20 [00:00<?, ?it/s]
*** Arguments: ('task(sanwy5q1miu9grj)', 'a bird in a tree', '', [], 20, 0, False, False, 1, 1, 6, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, 0, '', '', [], <gradio.routes.Request object at 0x000001A63FB97370>, 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0) {}
    Traceback (most recent call last):
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\call_queue.py", line 58, in f
        res = list(func(*args, **kwargs))
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\call_queue.py", line 37, in f
        res = func(*args, **kwargs)
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\txt2img.py", line 62, in txt2img
        processed = processing.process_images(p)
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\processing.py", line 677, in process_images
        res = process_images_inner(p)
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\processing.py", line 794, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\processing.py", line 1054, in sample
        samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 464, in sample
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 303, in launch_sampling
        return func()
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 464, in <lambda>
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
      File "F:\Stable Diffusion 3\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
      File "F:\Stable Diffusion 3\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
        denoised = model(x, sigmas[i] * s_in, **extra_args)
      File "F:\Stable Diffusion 3\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 215, in forward
        devices.test_for_nans(x_out, "unet")
      File "F:\Stable Diffusion 3\stable-diffusion-webui\modules\devices.py", line 155, in test_for_nans
        raise NansException(message)
    modules.devices.NansException: A tensor with all NaNs was produced in Unet. Use --disable-nan-check commandline argument to disable this check.

---

Additional information

No response

dhwz commented 1 year ago

Does other software like ComfyUI still work on your device?

gurilagardnr commented 1 year ago

ComfyUI also produces NaNs.

On Fri, Jul 28, 2023, 6:10 AM dhwz @.***> wrote:

Does other software like ComfyUI still work on your device?

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12131#issuecomment-1655433233, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBSEDF23GUAOZ54N2AVU6CDXSOFZ3ANCNFSM6AAAAAA23EMH3Q . You are receiving this because you authored the thread.Message ID: @.***>

dhwz commented 1 year ago

Hm maybe a hardware defect?

gurilagardnr commented 1 year ago

That's the way I am leaning. But, I wanted to put this out there just in case someone one day has a eureka moment

On Fri, Jul 28, 2023, 12:03 PM dhwz @.***> wrote:

Hm maybe a hardware defect?

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12131#issuecomment-1655939125, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBSEDFYNSTVTUIZNNVSDQYDXSPPFJANCNFSM6AAAAAA23EMH3Q . You are receiving this because you authored the thread.Message ID: @.***>

ClashSAN commented 1 year ago

Then, you may want to repost this to the torch library repository. What gpu is it?

gurilagardnr commented 1 year ago

It's 4070ti 12gb, and, that is a very good suggestion.

On Sat, Jul 29, 2023, 1:43 AM ClashSAN @.***> wrote:

Then, you may want to repost this to the torch library repository. What gpu is it?

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12131#issuecomment-1656565720, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBSEDF3HFATXAB3D5ZYJVADXSSPI5ANCNFSM6AAAAAA23EMH3Q . You are receiving this because you authored the thread.Message ID: @.***>

Kujoz commented 1 year ago

Hey, not sure if you're still having issues, but I encountered this same issue when working with SDXL and found a "fix".

Problem: I encountered the issue when I changed my model from 1.5 to SDXL and trying to img2img. I received the same error message, and changed from refiner to normal and still encountered the issue.

Workaround: Generating a single image with the base SDXL model in txt2img, allowed me to go back to the img2img tab with all previous parameters. Worked completely normal.

Very un-scientific but it works for me now.

gurilagardnr commented 1 year ago

thank you for the response. Unfortunately my issue appears to be more severe. Switching models sometimes provides relief, but the issue returns after a half-dozen image generations. The only workaround I have developed at this point is to completely shutdown the pc. At this point i believe it is a bugged interaction between torch 2.0.1 and my msi 4070ti gpu.

On Wed, Aug 2, 2023 at 12:16 AM Kujoz @.***> wrote:

Hey, not sure if you're still having issues, but I encountered this same issue when working with SDXL and found a "fix".

Problem: I encountered the issue when I changed my model from 1.5 to SDXL and trying to img2img. I received the same error message, and changed from refiner to normal and still encountered the issue.

Workaround: Generating a single image with the base SDXL model in txt2img, allowed me to go back to the img2img tab with all previous parameters. Worked completely normal.

Very un-scientific but it works for me now.

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12131#issuecomment-1661465032, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBSEDFZ3D3CIUVS7RCEL5IDXTHICJANCNFSM6AAAAAA23EMH3Q . You are receiving this because you authored the thread.Message ID: @.***>

-- Adam Chido

239-634-7051

2blackbar commented 1 year ago

This is an error inside of this webui , some algorithm with math tries to divide zeros and its causing this exception , trying to use -- no half or other weird solution is pretty weird , i did not bought my gpu to use half its speed cmon man , this should be major concern now for devs as its getting worse and worse with each release and looks like nobody is taking a look at this to try to prevent dividing like this

dhwz commented 1 year ago

@2blackbar seems you've no clue what you're talking about. The TO already said it's also happening with other software.

gurilagardnr commented 1 year ago

As to the software issue, it's weird too. I can't generate any images using SDNEXT, the auto1111 fork. It produces NaNs immediately, regardless of configuration. ComfyUI is more rare, and only consistantly produces NaNs when trying to use SDXL. I have seemed to reach some sort of equilibrium with automatic1111 where it only consistantly produces NaNs if I use too many dynamic prompts or push hirezfix beyond 1.5 scale. The problem still exists, with complete shutdown's required after a few hundred generations, but I live with it. I swapped the 4070ti for a pny 3060ti and no NaN's were produced by any software. So it is definately an interaction between the MSI 4070ti and stable diffusion software, and something that all three software projects have in common, which is why I believe torch is the culprit at this point.

2blackbar commented 1 year ago

Found a fix that works 100% of the time https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/12292

gurilagardnr commented 1 year ago

I have tried similar, as in, switching model and/or vae with random success and failure. I just finished attempting your method of using an identical, renamed model, with no success. I tried on both auto1111 and sdnext. Thank you for the consideration, however.

On Fri, Aug 4, 2023 at 1:17 AM bWWd @.***> wrote:

Found a fix that works 100% of the time

12292

https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/12292

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/12131#issuecomment-1665014010, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBSEDF5O4JHXAN5ZZA3XCNTXTSAYBANCNFSM6AAAAAA23EMH3Q . You are receiving this because you authored the thread.Message ID: @.***>

-- Adam Chido

239-634-7051

nobody4t commented 1 year ago

I have the same issue on Mac M1. And the image I got is all black.

Please help give some clue or direction how to fix. Thanks in advance.

Can anyone help me understand:

Why lib diffusers not used here? It is so simple I can produce a image with just lines of pythong codes.

00DB00 commented 1 year ago

I am also getting exactly similar issue of NaNs , the only thing i can think of is some days ago i tried using SDXL which didnt work prperly for me so i switched back to SD1.5, today when i was trying to generate images i tried changing multiple checkpoints as the images which were getting generated earlier were very good , but today it was not with same settings.... and other time i am getting NaNs issue.... not sure what happen....

catboxanon commented 1 year ago

Duplicate of https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6923

AUTOMATIC1111 / stable-diffusion-webui