Jack000 / glid-3-xl

1.4B latent diffusion model fine tuning
MIT License
261 stars 50 forks source link

RuntimeError: CUDA error: unknown error #9

Open Moltennn opened 2 years ago

Moltennn commented 2 years ago

I can't figure why i'm getting this error

python sample.py --model_path finetune.pt --batch_size 1 --num_batches 1 --text "a cyberpunk girl with a scifi neuralink device on her head"

Using device: cuda:0
Traceback (most recent call last):
  File "sample.py", line 284, in <module>
    ldm.to(device)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 121, in to
    return super().to(*args, **kwargs)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Trying to run with CUDA_LAUNCH_BLOCKING enabled

CUDA_LAUNCH_BLOCKING=1 python sample.py --model_path finetune.pt --batch_size 1 --num_batches 1 --text "a cyberpunk girl with a scifi neuralink device on her head"

Using device: cuda:0
Traceback (most recent call last):
  File "sample.py", line 284, in <module>
    ldm.to(device)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 121, in to
    return super().to(*args, **kwargs)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/home/moltenn/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: unknown error

pip freeze

absl-py==1.1.0
aiohttp==3.8.1
aiosignal==1.2.0
albumentations==0.4.3
altair==4.2.0
antlr4-python3-runtime==4.8
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.5
async-timeout==4.0.2
attrs==21.4.0
axial-positional-embedding==0.2.1
backcall==0.2.0
backports.zoneinfo==0.2.1
beautifulsoup4==4.11.1
bleach==5.0.1
blinker==1.4
blobfile==1.3.1
braceexpand==0.1.7
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1648854175163/work
cachetools==5.2.0
certifi==2022.6.15
cffi==1.15.0
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1655906222726/work
click==8.1.3
-e git+https://github.com/openai/CLIP.git@b46f5ac7587d2e1862f8b7b1573179d80dcdd620#egg=clip
commonmark==0.9.1
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1652967113783/work
DALL-E==0.1
dalle-pytorch==1.6.4
debugpy==1.6.0
decorator==5.1.1
defusedxml==0.7.1
einops==0.4.1
entrypoints==0.4
executing==0.8.3
fastjsonschema==2.15.3
filelock==3.7.1
frozenlist==1.3.0
fsspec==2022.5.0
ftfy==6.1.1
future==0.18.2
gitdb==4.0.9
GitPython==3.1.27
google-auth==2.9.0
google-auth-oauthlib==0.4.6
grpcio==1.47.0
-e git+https://github.com/Jack000/glid-3-xl@a0b5be4b04378d4d4779240d3e0a599360c1a133#egg=guided_diffusion
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1642433548627/work
imageio==2.9.0
imageio-ffmpeg==0.4.2
imgaug==0.2.6
importlib-metadata==4.12.0
importlib-resources==5.8.0
iniconfig==1.1.1
ipykernel==6.15.0
ipython==8.4.0
ipython-genutils==0.2.0
ipywidgets==7.7.1
jedi==0.18.1
Jinja2==3.1.2
joblib==1.1.0
jsonschema==4.6.1
jupyter-client==7.3.4
jupyter-core==4.10.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==1.1.1
-e git+https://github.com/CompVis/latent-diffusion.git@5a6571e384f9a9b492bbfaca594a2b00cad55279#egg=latent_diffusion
Markdown==3.3.7
MarkupSafe==2.1.1
matplotlib-inline==0.1.3
mistune==0.8.4
mkl-fft==1.3.1
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186064646/work
mkl-service==2.4.0
multidict==6.0.2
mypy==0.961
mypy-extensions==0.4.3
nbclient==0.6.5
nbconvert==6.5.0
nbformat==5.4.0
nest-asyncio==1.5.5
networkx==2.8.4
notebook==6.4.12
numpy @ file:///opt/conda/conda-bld/numpy_and_numpy_base_1654872176621/work
oauthlib==3.2.0
omegaconf==2.1.1
opencv-python==4.1.2.30
opencv-python-headless==4.6.0.66
packaging==21.3
pandas==1.4.3
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.0.1
pluggy==1.0.0
prometheus-client==0.14.1
prompt-toolkit==3.0.30
protobuf==3.19.4
psutil==5.9.1
ptyprocess==0.7.0
pudb==2019.2
pure-eval==0.2.2
py==1.11.0
pyarrow==8.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pycryptodomex==3.15.0
pydeck==0.7.1
pyDeprecate==0.3.2
Pygments==2.12.0
Pympler==1.0.1
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1643496850550/work
pyparsing==3.0.9
pyrsistent==0.18.1
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1648857275402/work
pytest==7.1.2
python-dateutil==2.8.2
pytorch-lightning==1.6.4
pytz==2022.1
pytz-deprecation-shim==0.1.0.post0
PyWavelets==1.3.0
PyYAML==6.0
pyzmq==23.2.0
regex==2022.6.2
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1656534056640/work
requests-oauthlib==1.3.1
rich==12.4.4
rotary-embedding-torch==0.1.5
rsa==4.8
sacremoses==0.0.53
scikit-image==0.19.3
scipy==1.8.1
semver==2.13.0
Send2Trash==1.8.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.0
soupsieve==2.3.2.post1
stack-data==0.3.0
streamlit==1.10.0
-e git+https://github.com/CompVis/taming-transformers.git@24268930bf1dce879235a7fddd0b2355b84d7ea6#egg=taming_transformers
taming-transformers-rom1504==0.0.6
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
terminado==0.15.0
test-tube==0.7.5
tifffile==2022.5.4
tinycss2==1.1.1
tokenizers==0.10.3
toml==0.10.2
tomli==2.0.1
toolz==0.11.2
torch==1.12.0
torch-fidelity==0.3.0
torchaudio==0.12.0
torchmetrics==0.9.2
torchvision==0.13.0
tornado==6.1
tqdm==4.64.0
traitlets==5.3.0
transformers==4.3.1
typing-extensions @ file:///opt/conda/conda-bld/typing_extensions_1647553014482/work
tzdata==2022.1
tzlocal==4.2
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1647489083693/work
urwid==2.1.2
validators==0.20.0
watchdog==2.1.9
wcwidth==0.2.5
webdataset==0.2.5
webencodings==0.5.1
Werkzeug==2.1.2
widgetsnbextension==3.6.1
xmltodict==0.12.0
yarl==1.7.2
youtokentome==1.0.6
zipp==3.8.0
limiteinductive commented 2 years ago

your ldm model should not be using pytorch-lightning to load... try unstalling pytorch-lightning maybe

Moltennn commented 2 years ago

That didn't work. It just said something like "missing module pytorch-lightning" Anyway i tried to purge the whole container or w/e those are called and reinstalling. Well no success there either. This whole shenanigan was done on wsl ubuntu.

So i decided to install everything on windows. And got this error. Guess my poor old gtx 970 isn't fit for this :D

python sample.py --model_path finetune.pt --batch_size 1 --num_batches 1 --text "a cyberpunk girl with a scifi neuralink device on her head"

Using device: cuda:0
Traceback (most recent call last):
  File "sample.py", line 284, in <module>
    ldm.to(device)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\pytorch_lightning\core\mixins\device_dtype_mixin.py", line 111, in to
    return super().to(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 927, in to
    return self._apply(convert)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply
    param_applied = fn(param)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.47 GiB already allocated; 0 bytes free; 3.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Then i tried with --cpu parameter to see how it'd go..

python sample.py --cpu --model_path finetune.pt --batch_size 1 --num_batches 1 --text "a cyberpunk girl with a scifi neuralink device on her head"

Using device: cpu
Traceback (most recent call last):
  File "sample.py", line 522, in <module>
    do_run()
  File "sample.py", line 307, in do_run
    text_emb = bert.encode([args.text]*args.batch_size).to(device).float()
  File "C:\Users\Administrator\txt2img\glid-3-xl\encoders\modules.py", line 99, in encode
    return self(text)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\txt2img\glid-3-xl\encoders\modules.py", line 94, in forward
    z = self.transformer(tokens, return_embeddings=True)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\txt2img\glid-3-xl\encoders\x_transformer.py", line 609, in forward
    x = self.token_emb(x)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\modules\sparse.py", line 158, in forward
    return F.embedding(
  File "C:\ProgramData\Anaconda3\envs\ldm\lib\site-packages\torch\nn\functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Sylvainsbrr commented 2 years ago

Its not your GPU i have same issue with 3090. This versions resolved the issue : pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html