NVIDIA / Stable-Diffusion-WebUI-TensorRT

TensorRT Extension for Stable Diffusion Web UI
MIT License
1.91k stars 146 forks source link

Full Info Provided - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) #25

Closed FurkanGozukara closed 1 year ago

FurkanGozukara commented 1 year ago

yep i just shown this

on dev branch of auto1111 it fails to generate onnx

so we generate on master branch and use on dev branch haha

Watch below video to learn how to compile SDXL TensorRT and use it - it includes both manual and auto way

RTX Acceleration Quick Tutorial With Auto Installer V2 SDXL - Tensor RT

image

Here literally everything that you may need. I can also follow your instructions and test them

SD 1.5 based models are working i even made a tutorial

RTX Acceleration Quick Tutorial With Auto Installer

image

Done on a fresh install. Python 3.10.11 and cudnn windows-x86_64-8.9.4.25

First let me show pip freeze

Microsoft Windows [Version 10.0.19045.3570]
(c) Microsoft Corporation. All rights reserved.

G:\auto_quick\stable-diffusion-webui\venv\Scripts>activate

(venv) G:\auto_quick\stable-diffusion-webui\venv\Scripts>pip freeze
absl-py==2.0.0
accelerate==0.21.0
addict==2.4.0
aenum==3.1.15
aiofiles==23.2.1
aiohttp==3.8.6
aiosignal==1.3.1
altair==5.1.2
antlr4-python3-runtime==4.9.3
anyio==3.7.1
async-timeout==4.0.3
attrs==23.1.0
basicsr==1.4.2
beautifulsoup4==4.12.2
blendmodes==2022
boltons==23.0.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.3.0
clean-fid==0.1.35
click==8.1.7
clip==1.0
colorama==0.4.6
contourpy==1.1.1
cycler==0.12.1
deprecation==2.1.0
einops==0.4.1
exceptiongroup==1.1.3
facexlib==0.3.0
fastapi==0.94.0
ffmpy==0.3.1
filelock==3.12.4
filterpy==1.4.5
fonttools==4.43.1
frozenlist==1.4.0
fsspec==2023.9.2
ftfy==6.1.1
future==0.18.3
gdown==4.7.1
gfpgan==1.3.8
gitdb==4.0.10
GitPython==3.1.32
google-auth==2.23.3
google-auth-oauthlib==1.1.0
gradio==3.41.2
gradio_client==0.5.0
grpcio==1.59.0
h11==0.12.0
httpcore==0.15.0
httpx==0.24.1
huggingface-hub==0.18.0
idna==3.4
imageio==2.31.5
importlib-metadata==6.8.0
importlib-resources==6.1.0
inflection==0.5.1
Jinja2==3.1.2
jsonmerge==1.8.0
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
kiwisolver==1.4.5
kornia==0.6.7
lark==1.1.2
lazy_loader==0.3
lightning-utilities==0.9.0
llvmlite==0.41.0
lmdb==1.4.1
lpips==0.1.4
Markdown==3.5
MarkupSafe==2.1.3
matplotlib==3.8.0
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
numba==0.58.0
numpy==1.23.5
nvidia-cublas-cu11==11.11.3.6
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-runtime-cu11==2022.4.25
nvidia-cuda-runtime-cu117==11.7.60
nvidia-cudnn-cu11==8.9.4.25
oauthlib==3.2.2
omegaconf==2.2.3
onnx==1.14.1
onnx-graphsurgeon==0.3.27
open-clip-torch==2.20.0
opencv-python==4.8.1.78
orjson==3.9.9
packaging==23.2
pandas==2.1.1
piexif==1.1.3
Pillow==9.5.0
platformdirs==3.11.0
polygraphy==0.49.0
protobuf==3.20.2
psutil==5.9.5
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.13
pydub==0.25.1
pyparsing==3.1.1
PySocks==1.7.1
python-dateutil==2.8.2
python-multipart==0.0.6
pytorch-lightning==1.9.4
pytz==2023.3.post1
PyWavelets==1.4.1
PyYAML==6.0.1
realesrgan==0.3.0
referencing==0.30.2
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
resize-right==0.0.2
rpds-py==0.10.6
rsa==4.9
safetensors==0.3.1
scikit-image==0.21.0
scipy==1.11.3
semantic-version==2.10.0
sentencepiece==0.1.99
six==1.16.0
smmap==5.0.1
sniffio==1.3.0
soupsieve==2.5
starlette==0.26.1
sympy==1.12
tb-nightly==2.15.0a20231017
tensorboard-data-server==0.7.1
tensorrt==9.0.1.post11.dev4
tensorrt-bindings==9.0.1.post11.dev4
tensorrt-libs==9.0.1.post11.dev4
tifffile==2023.9.26
timm==0.9.2
tokenizers==0.13.3
tomesd==0.1.3
tomli==2.0.1
toolz==0.12.0
torch==2.0.1+cu118
torchdiffeq==0.2.3
torchmetrics==1.2.0
torchsde==0.2.5
torchvision==0.15.2+cu118
tqdm==4.66.1
trampoline==0.1.2
transformers==4.30.2
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
uvicorn==0.23.2
wcwidth==0.2.8
websockets==11.0.3
Werkzeug==3.0.0
xformers==0.0.20
yapf==0.40.2
yarl==1.9.2
zipp==3.17.0

(venv) G:\auto_quick\stable-diffusion-webui\venv\Scripts>

Here the full log of the Automatic1111

venv "G:\auto_quick\stable-diffusion-webui\venv\Scripts\Python.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Version: v1.6.0
Commit hash: 5ef669de080814067961f28357256e8fe27544f4
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: protobuf==3.20.2 in g:\auto_quick\stable-diffusion-webui\venv\lib\site-packages (3.20.2)
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com, https://pypi.ngc.nvidia.com
Requirement already satisfied: onnx-graphsurgeon in g:\auto_quick\stable-diffusion-webui\venv\lib\site-packages (0.3.27)
Requirement already satisfied: numpy in g:\auto_quick\stable-diffusion-webui\venv\lib\site-packages (from onnx-graphsurgeon) (1.23.5)
Requirement already satisfied: onnx in g:\auto_quick\stable-diffusion-webui\venv\lib\site-packages (from onnx-graphsurgeon) (1.14.1)
Requirement already satisfied: typing-extensions>=3.6.2.1 in g:\auto_quick\stable-diffusion-webui\venv\lib\site-packages (from onnx->onnx-graphsurgeon) (4.8.0)
Requirement already satisfied: protobuf>=3.20.2 in g:\auto_quick\stable-diffusion-webui\venv\lib\site-packages (from onnx->onnx-graphsurgeon) (3.20.2)
GS is not installed! Installing...
Installing protobuf
Installing onnx-graphsurgeon
UI Config not initialized
Launching Web UI with arguments: --xformers
Loading weights [31e35c80fc] from G:\auto_quick\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors
Creating model from config: G:\auto_quick\stable-diffusion-webui\repositories\generative-models\configs\inference\sd_xl_base.yaml
Loading VAE weights specified in settings: G:\auto_quick\stable-diffusion-webui\models\VAE\fp16_sdxl_vae.safetensors
Applying attention optimization: xformers... done.
Model loaded in 3.4s (load weights from disk: 0.7s, create model: 0.2s, apply weights to model: 2.2s, load VAE: 0.1s, calculate empty prompt: 0.1s).
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 19.0s (prepare environment: 9.3s, import torch: 1.6s, import gradio: 0.5s, setup paths: 0.4s, initialize shared: 0.2s, other imports: 0.3s, load scripts: 3.0s, create ui: 3.5s, gradio launch: 0.2s).
{'sample': [(1, 4, 96, 96), (2, 4, 128, 128), (8, 4, 128, 128)], 'timesteps': [(1,), (2,), (8,)], 'encoder_hidden_states': [(1, 77, 2048), (2, 77, 2048), (8, 154, 2048)], 'y': [(1, 2816), (2, 2816), (8, 2816)]}
Building TensorRT engine for G:\auto_quick\stable-diffusion-webui\models\Unet-onnx\sd_xl_base_1.0_be9edd61.onnx: G:\auto_quick\stable-diffusion-webui\models\Unet-trt\sd_xl_base_1.0_be9edd61_cc86_sample=1x4x96x96+2x4x128x128+8x4x128x128-timesteps=1+2+8-encoder_hidden_states=1x77x2048+2x77x2048+8x154x2048-y=1x2816+2x2816+8x2816.trt
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored
[I] Loading tactic timing cache from G:\auto_quick\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\timing_caches\timing_cache_win_cc86.cache
[I] Building engine with configuration:
    Flags                  | [FP16, REFIT, TF32]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 24563.50 MiB, TACTIC_DRAM: 24563.50 MiB]
    Tactic Sources         | [CUBLAS, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.LAYER_NAMES_ONLY
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
Building engine: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [07:38<00:00, 76.41s/it]
[I] Finished engine building in 462.608 seconds
[I] Saving tactic timing cache to G:\auto_quick\stable-diffusion-webui\extensions\Stable-Diffusion-WebUI-TensorRT\timing_caches\timing_cache_win_cc86.cache
[I] Saving engine to G:\auto_quick\stable-diffusion-webui\models\Unet-trt\sd_xl_base_1.0_be9edd61_cc86_sample=1x4x96x96+2x4x128x128+8x4x128x128-timesteps=1+2+8-encoder_hidden_states=1x77x2048+2x77x2048+8x154x2048-y=1x2816+2x2816+8x2816.trt
Downloading VAEApprox model to: G:\auto_quick\stable-diffusion-webui\models\VAE-approx\vaeapprox-sdxl.pt
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 209k/209k [00:00<00:00, 1.43MB/s]
Activating unet: [TRT] sd_xl_base_1.0
Loading TensorRT engine: G:\auto_quick\stable-diffusion-webui\models\Unet-trt\sd_xl_base_1.0_be9edd61_cc86_sample=1x4x96x96+2x4x128x128+8x4x128x128-timesteps=1+2+8-encoder_hidden_states=1x77x2048+2x77x2048+8x154x2048-y=1x2816+2x2816+8x2816.trt
[I] Loading bytes from G:\auto_quick\stable-diffusion-webui\models\Unet-trt\sd_xl_base_1.0_be9edd61_cc86_sample=1x4x96x96+2x4x128x128+8x4x128x128-timesteps=1+2+8-encoder_hidden_states=1x77x2048+2x77x2048+8x154x2048-y=1x2816+2x2816+8x2816.trt
Profile 0:
        sample = [(1, 4, 96, 96), (2, 4, 128, 128), (8, 4, 128, 128)]
        timesteps = [(1,), (2,), (8,)]
        encoder_hidden_states = [(1, 77, 2048), (2, 77, 2048), (8, 154, 2048)]
        y = [(1, 2816), (2, 2816), (8, 2816)]
        latent = [(-1946014208), (-1946015985), (-1946020080)]

  0%|                                                                                                                      | 0/150 [00:00<?, ?it/s]
*** Error completing request
*** Arguments: ('task(nxb62g6nrm8ro2s)', 'photo of a car ', '', [], 150, 'DPM++ 2M SDE Karras', 1, 1, 7, 1024, 1024, False, 0.7, 2, 'Latent', 0, 0, 0, 'Use same checkpoint', 'Use same sampler', '', '', [], <gradio.routes.Request object at 0x000001DBD1C626B0>, 0, False, '', 0.8, 2608786895, False, -1, 0, 0, 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0, False) {}
    Traceback (most recent call last):
      File "G:\auto_quick\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
      File "G:\auto_quick\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\modules\txt2img.py", line 55, in txt2img
        processed = processing.process_images(p)
      File "G:\auto_quick\stable-diffusion-webui\modules\processing.py", line 732, in process_images
        res = process_images_inner(p)
      File "G:\auto_quick\stable-diffusion-webui\modules\processing.py", line 867, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
      File "G:\auto_quick\stable-diffusion-webui\modules\processing.py", line 1140, in sample
        samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
      File "G:\auto_quick\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 235, in sample
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "G:\auto_quick\stable-diffusion-webui\modules\sd_samplers_common.py", line 261, in launch_sampling
        return func()
      File "G:\auto_quick\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 235, in <lambda>
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 626, in sample_dpmpp_2m_sde
        denoised = model(x, sigmas[i] * s_in, **extra_args)
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\modules\sd_samplers_cfg_denoiser.py", line 169, in forward
        x_out = self.inner_model(x_in, sigma_in, cond=make_condition_dict(cond_in, image_cond_in))
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
        eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
        return self.inner_model.apply_model(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\modules\sd_models_xl.py", line 37, in apply_model
        return self.model(x, t, cond)
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\modules\sd_hijack_utils.py", line 17, in <lambda>
        setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
      File "G:\auto_quick\stable-diffusion-webui\modules\sd_hijack_utils.py", line 28, in __call__
        return self.__orig_func(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\repositories\generative-models\sgm\modules\diffusionmodules\wrappers.py", line 28, in forward
        return self.diffusion_model(
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\repositories\generative-models\sgm\modules\diffusionmodules\openaimodel.py", line 984, in forward
        emb = self.time_embed(t_emb)
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
        input = module(input)
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "G:\auto_quick\stable-diffusion-webui\extensions-builtin\Lora\networks.py", line 429, in network_Linear_forward
        return originals.Linear_forward(self, input)
      File "G:\auto_quick\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
        return F.linear(input, self.weight, self.bias)
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

---
### Tasks
FurkanGozukara commented 1 year ago

Automatic1111 DEV version is working for SDXL

the speed is amazing

6.14 it / second

nailz420 commented 1 year ago

6.14 it / second

This means nothing without your hardware specs and test case details

camoody1 commented 1 year ago

Your video said nothing at all about the error you have in the post title.

Thank you for wasting my time.

contentis commented 1 year ago

Thank you for your contribution! The readme has been updated now to inform users that SDXL depends on a fix in the dev branch.

FurkanGozukara commented 1 year ago

6.14 it / second

This means nothing without your hardware specs and test case details

RTX 3090 ti 1024x1024

I am preparing a big video @camoody1

that small video is just like a preview

also i am preparing auto downloader of pre compiled tensor RT

FurkanGozukara commented 1 year ago

Watch below video to learn how to compile SDXL TensorRT and use it - it includes both manual and auto way

RTX Acceleration Quick Tutorial With Auto Installer V2 SDXL - Tensor RT

image

maxious commented 1 year ago

FWIW the resolution to the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu" error is disable the "medvram"/"lowvram" optimisation https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimizations

FurkanGozukara commented 1 year ago

FWIW the resolution to the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu" error is disable the "medvram"/"lowvram" optimisation https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimizations

important i will tell this too in big tutorial

still for sdxl you need dev branch