Error building torch on clean `docker compose --profile auto up --build`

dmarx commented 1 year ago

Has this issue been opened before?

[x] It is not in the FAQ, I checked.
[x] It is not in the issues, I searched.

Describe the bug

first attempt at building. the docker compose --profile download up --build step worked fine. attempting to run docker compose --profile auto up --build resulted in the following error:

=> => extracting sha256:3fd92eeca8f54976c24de929011349e191dc349bf932629b  0.0s
 => [xformers 2/3] RUN apk add --no-cache aria2                            3.0s
 => [xformers 3/3] RUN aria2c -x 5 --dir / --out wheel.whl 'https://gith  24.0s
 => [download 2/8] COPY clone.sh /clone.sh                                 0.1s
 => [download 3/8] RUN . /clone.sh taming-transformers https://github.co  16.0s
 => [download 4/8] RUN . /clone.sh stable-diffusion-stability-ai https:/  11.4s
 => [download 5/8] RUN . /clone.sh CodeFormer https://github.com/sczhou/C  2.0s
 => [download 6/8] RUN . /clone.sh BLIP https://github.com/salesforce/BLI  1.8s
 => [download 7/8] RUN . /clone.sh k-diffusion https://github.com/crowson  0.8s
 => [download 8/8] RUN . /clone.sh clip-interrogator https://github.com/p  0.9s
 => ERROR [stage-2  2/15] RUN --mount=type=cache,target=/root/.cache/pip  64.7s
------
 > [stage-2  2/15] RUN --mount=type=cache,target=/root/.cache/pip   pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117:
#0 1.315 Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu117
#0 2.060 Collecting torch==1.13.1+cu117
#0 2.077   Downloading https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl (1801.8 MB)
#0 63.72      ━━━━━━━━━━━                              0.5/1.8 GB 11.7 MB/s eta 0:01:52
#0 63.72 ERROR: Exception:
#0 63.72 Traceback (most recent call last):
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 437, in _error_catcher
#0 63.72     yield
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 560, in read
#0 63.72     data = self._fp_read(amt) if not fp_closed else b""
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 526, in _fp_read
#0 63.72     return self._fp.read(amt) if amt is not None else self._fp.read()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 90, in read
#0 63.72     data = self.__fp.read(amt)
#0 63.72   File "/usr/local/lib/python3.10/http/client.py", line 465, in read
#0 63.72     s = self.fp.read(amt)
#0 63.72   File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
#0 63.72     return self._sock.recv_into(b)
#0 63.72   File "/usr/local/lib/python3.10/ssl.py", line 1274, in recv_into
#0 63.72     return self.read(nbytes, buffer)
#0 63.72   File "/usr/local/lib/python3.10/ssl.py", line 1130, in read
#0 63.72     return self._sslobj.read(len, buffer)
#0 63.72 TimeoutError: The read operation timed out
#0 63.72 
#0 63.72 During handling of the above exception, another exception occurred:
#0 63.72 
#0 63.72 Traceback (most recent call last):
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
#0 63.72     status = run_func(*args)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
#0 63.72     return func(self, options, args)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 400, in run
#0 63.72     requirement_set = resolver.resolve(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
#0 63.72     result = self._result = resolver.resolve(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
#0 63.72     state = resolution.resolve(requirements, max_rounds=max_rounds)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 348, in resolve
#0 63.72     self._add_to_criteria(self.state.criteria, r, parent=None)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 172, in _add_to_criteria
#0 63.72     if not criterion.candidates:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
#0 63.72     return bool(self._sequence)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
#0 63.72     return any(self)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
#0 63.72     return (c for c in iterator if id(c) not in self._incompatible_ids)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
#0 63.72     candidate = func()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
#0 63.72     self._link_candidate_cache[link] = LinkCandidate(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 297, in __init__
#0 63.72     super().__init__(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 162, in __init__
#0 63.72     self.dist = self._prepare()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 231, in _prepare
#0 63.72     dist = self._prepare_distribution()
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 308, in _prepare_distribution
#0 63.72     return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 491, in prepare_linked_requirement
#0 63.72     return self._prepare_linked_requirement(req, parallel_builds)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 536, in _prepare_linked_requirement
#0 63.72     local_file = unpack_url(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 166, in unpack_url
#0 63.72     file = get_http_url(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 107, in get_http_url
#0 63.72     from_path, content_type = download(link, temp_dir.path)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/download.py", line 147, in __call__
#0 63.72     for chunk in chunks:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/cli/progress_bars.py", line 53, in _rich_progress_bar
#0 63.72     for chunk in iterable:
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_internal/network/utils.py", line 63, in response_chunks
#0 63.72     for chunk in response.raw.stream(
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 621, in stream
#0 63.72     data = self.read(amt=amt, decode_content=decode_content)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 559, in read
#0 63.72     with self._error_catcher():
#0 63.72   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
#0 63.72     self.gen.throw(typ, value, traceback)
#0 63.72   File "/usr/local/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 442, in _error_catcher
#0 63.72     raise ReadTimeoutError(self._pool, None, "Read timed out.")
#0 63.72 pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='download.pytorch.org', port=443): Read timed out.
#0 64.51 
#0 64.51 [notice] A new release of pip available: 22.3.1 -> 23.1.1
#0 64.51 [notice] To update, run: pip install --upgrade pip
------
failed to solve: executor failed running [/bin/sh -c pip install torch==1.13.1+cu117 torchvision --extra-index-url https://download.pytorch.org/whl/cu117]: exit code: 2

Which UI

auto

Hardware / Software

OS: Ubuntu 22.04
OS version: Ubuntu 22.04.2 LTS
Docker Version:


Client: Docker Engine - Community
Cloud integration: v1.0.31
Version:           23.0.4
API version:       1.41 (downgraded from 1.42)
Go version:        go1.19.8
Git commit:        f480fb1
Built:             Fri Apr 14 10:32:03 2023
OS/Arch:           linux/amd64
Context:           desktop-linux

Server: Docker Desktop 4.18.0 (104112) Engine: Version: 20.10.24 API version: 1.41 (minimum version 1.12) Go version: go1.19.7 Git commit: 5d6db84 Built: Tue Apr 4 18:18:42 2023 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.18 GitCommit: 2456e983eb9e37e47538f59ea18f2043c9a73640 runc: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0



- Docker compose version: v2.17.2
- Repo version: 2a0de025e2aece3d25178d2ade55065b82b54511
- RAM: plenty
- GPU/VRAM: 3090

dmarx commented 1 year ago

let's see if this works

EDIT: yeah... don't do this. that torch version is pinned for a reason.

dmarx commented 1 year ago

new error now after unpinning torch:

 => [stage-2 15/15] WORKDIR /stable-diffusion-webui                                            0.0s 
 => exporting to image                                                                        21.1s 
 => => exporting layers                                                                       21.1s 
 => => writing image sha256:500eb74eac4bb4c9d06516f9f971fdbee75013b509c002666788f73fbe08b742   0.0s 
 => => naming to docker.io/library/sd-auto:51                                                  0.0s
[+] Running 1/1
 ✔ Container webui-docker-auto-1  Created                                                      0.2s 
Attaching to webui-docker-auto-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

i think this means I'm missing my cuda drivers?

dmarx commented 1 year ago

confirmed... I didn't have my cuda stuff configured :/

For posterity:

nvidia-smi works properly, so does the hello-world nvidia docker container. still getting the same error :(

dmarx commented 1 year ago

deleted and rebuilt containers and images, still no luck.

=> => exporting layers                                                                                                        0.0s
 => => writing image sha256:500eb74eac4bb4c9d06516f9f971fdbee75013b509c002666788f73fbe08b742                                   0.0s
 => => naming to docker.io/library/sd-auto:51                                                                                  0.0s
[+] Running 1/0
 ✔ Container webui-docker-auto-1  Created                                                                                      0.0s 
Attaching to webui-docker-auto-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

dmarx commented 1 year ago

tried sudo-ing the command, seems to have at least gotten past the previous error. believe the root of the problem is discussed here: https://github.com/NVIDIA/nvidia-container-toolkit/issues/154

dmarx commented 1 year ago

services build, getting an error when trying to run a test prompt with everything else set to defaults...

webui-docker-auto-1  | Running on local URL:  http://0.0.0.0:7860
webui-docker-auto-1  | 
webui-docker-auto-1  | To create a public link, set `share=True` in `launch()`.
webui-docker-auto-1  | Startup time: 13.9s (import gradio: 0.8s, import ldm: 0.4s, other imports: 1.2s, load scripts: 0.2s, load SD checkpoint: 10.9s, create ui: 0.1s).
webui-docker-auto-1  | Error completing request
webui-docker-auto-1  | Arguments: ('task(td9v3amy7jrkdya)', 'a delicious cheeseburger', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, '', False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0) {}
webui-docker-auto-1  | Traceback (most recent call last):
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/call_queue.py", line 56, in f
webui-docker-auto-1  |     res = list(func(*args, **kwargs))
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/call_queue.py", line 37, in f
webui-docker-auto-1  |     res = func(*args, **kwargs)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/txt2img.py", line 56, in txt2img
webui-docker-auto-1  |     processed = process_images(p)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 486, in process_images
webui-docker-auto-1  |     res = process_images_inner(p)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 625, in process_images_inner
webui-docker-auto-1  |     uc = get_conds_with_caching(prompt_parser.get_learned_conditioning, negative_prompts, p.steps, cached_uc)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/processing.py", line 570, in get_conds_with_caching
webui-docker-auto-1  |     cache[1] = function(shared.sd_model, required_prompts, steps)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/prompt_parser.py", line 140, in get_learned_conditioning
webui-docker-auto-1  |     conds = model.get_learned_conditioning(texts)
webui-docker-auto-1  |   File "/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 669, in get_learned_conditioning
webui-docker-auto-1  |     c = self.cond_stage_model(c)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
webui-docker-auto-1  |     return forward_call(*input, **kwargs)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 229, in forward
webui-docker-auto-1  |     z = self.process_tokens(tokens, multipliers)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 254, in process_tokens
webui-docker-auto-1  |     z = self.encode_with_transformers(tokens)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/sd_hijack_clip.py", line 302, in encode_with_transformers
webui-docker-auto-1  |     outputs = self.wrapped.transformer(input_ids=tokens, output_hidden_states=-opts.CLIP_stop_at_last_layers)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl
webui-docker-auto-1  |     result = hook(self, input)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/lowvram.py", line 35, in send_me_to_gpu
webui-docker-auto-1  |     module.to(devices.device)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
webui-docker-auto-1  |     return self._apply(convert)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
webui-docker-auto-1  |     module._apply(fn)
webui-docker-auto-1  |   [Previous line repeated 2 more times]
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
webui-docker-auto-1  |     param_applied = fn(param)
webui-docker-auto-1  |   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
webui-docker-auto-1  |     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
webui-docker-auto-1  | RuntimeError: CUDA error: unspecified launch failure
webui-docker-auto-1  | CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
webui-docker-auto-1  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

AbdBarho commented 1 year ago

the first error that you got was just a timeout because of wonky internet connection, if you try building again it should be fixed (hopefully). Please keep pytorch pinned, otherwise you would get a lot of unexpected errors.

The second error seems weird, what is the output of this command?

docker run --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

if you get the same error, then it is probably a problem with docker not being able to see your GPU.

Make sure you have nvidia container toolkit installed and working.

dmarx commented 1 year ago

i think the issue might've been that i had nvidia-container-toolkit-base installed as well. I uninstalled both, reinstalled nvidia-container-toolkit, restarted, and i've got the test image generating successfully now. not sure if the issue was that package or that I just needed to restart. i'm only able to get docker to see my GPU when I run with sudo though which I'm not a huge fan of... anyway, looks like the issue was with me not realizing I'd skipped the pre-reqs on a too-fresh ubuntu re-install.

AviVarma commented 1 year ago

I've just had this issue too on Ubuntu 23.04. I fixed it by re-installing nvidia-container-toolkit!

AbdBarho / stable-diffusion-webui-docker

Error building torch on clean `docker compose --profile auto up --build` #420