HabanaAI / Model-References

Reference models for Intel(R) Gaudi(R) AI Accelerator
155 stars 79 forks source link

stable-diffusion-v-2-1 txt2img example fails with RuntimeError: Graph compile failed. #35

Open ctodd opened 12 months ago

ctodd commented 12 months ago

Environment: AWS DL1, Ubuntu 22.04 (bare metal driver install), Python 3.10.12, SynapseAI 1.12.1

Running in habanalabs-venv on the host OS (no container)

Followed instructions from Readme.md

$ python3 scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --n_samples 1 --n_iter 3 --use_hpu_graph

Seed set to 42 Loading model from v2-1_768-ema-pruned.ckpt Global Step: 110000 LatentDiffusion: Running in v-prediction mode DiffusionWrapper has 865.91 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 96 CPU RAM : 784282744 KB

Data shape for DDIM sampling is (1, 4, 96, 96), eta 0.0 Compiling HPU graph encode_with_transformer Traceback (most recent call last): File "/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py", line 360, in main(opt) File "/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py", line 300, in main c_in = runner.run(model.cond_stage_model.encode_with_transformer, tokens) File "/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py", line 222, in run graph.capture_begin() File "/home/ubuntu/habanalabs-venv/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 34, in capture_begin _hpu_C.capture_begin(self.hpu_graph, dry_run) RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generice failure].

$ pip list Package Version Editable project location


absl-py 2.0.0 accelerate 0.24.1 aiohttp 3.9.0 aiosignal 1.3.1 altair 5.1.2 annotated-types 0.6.0 antlr4-python3-runtime 4.8 anyio 3.7.1 appdirs 1.4.4 arrow 1.3.0 async-timeout 4.0.3 attrs 23.1.0 av 9.2.0 backoff 2.2.1 beautifulsoup4 4.12.2 blessed 1.20.0 boto3 1.29.3 botocore 1.32.3 cachetools 5.3.2 certifi 2023.11.17 cffi 1.15.1 cfgv 3.4.0 charset-normalizer 3.3.2 clean-fid 0.1.35 click 8.1.7 clip-anytorch 2.5.2 cmake 3.27.7 contourpy 1.2.0 croniter 1.4.1 cycler 0.12.1 dateutils 0.6.12 deepdiff 6.7.1 distlib 0.3.7 docker-pycreds 0.4.0 einops 0.3.0 exceptiongroup 1.1.3 expecttest 0.1.6 fastapi 0.104.1 ffmpy 0.3.1 filelock 3.13.1 fonttools 4.44.3 frozenlist 1.4.0 fsspec 2023.10.0 ftfy 6.1.1 gitdb 4.0.11 GitPython 3.1.40 google-auth 2.23.4 google-auth-oauthlib 0.4.6 gradio 3.13.1 grpcio 1.59.3 h11 0.12.0 habana-gpu-migration 1.12.1.10 habana-media-loader 1.12.1.10 habana-pyhlml 1.12.1.10 habana-torch-dataloader 1.12.1.10 habana-torch-plugin 1.12.1.10 httpcore 0.15.0 httpx 0.25.1 huggingface-hub 0.19.4 identify 2.5.31 idna 3.4 imageio 2.32.0 inquirer 3.1.3 intel-openmp 2023.2.0 itsdangerous 2.1.2 Jinja2 3.1.2 jmespath 1.0.1 jsonmerge 1.9.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.1 k-diffusion 0.0.14 kiwisolver 1.4.5 kornia 0.7.0 lazy_loader 0.3 lightning 2.0.6 lightning-cloud 0.5.54 lightning-habana 1.0.1 lightning-utilities 0.10.0 linkify-it-py 2.0.2 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 mdit-py-plugins 0.4.0 mdurl 0.1.2 mkl 2023.1.0 mkl-include 2023.1.0 mpi4py 3.1.4 mpmath 1.3.0 multidict 6.0.4 networkx 3.2.1 ninja 1.11.1.1 nodeenv 1.8.0 numpy 1.23.5 oauthlib 3.2.2 omegaconf 2.1.1 open-clip-torch 2.7.0 ordered-set 4.1.0 orjson 3.9.10 packaging 23.2 pandas 2.0.1 pathspec 0.11.2 perfetto 0.7.0 Pillow 10.0.1 Pillow-SIMD 7.0.0.post3 pip 22.3.1 platformdirs 3.11.0 pre-commit 3.3.3 protobuf 3.20.3 psutil 5.9.6 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.10.4 pycparser 2.21 pycryptodome 3.19.0 pydantic 2.0.3 pydantic_core 2.3.0 pydub 0.25.1 Pygments 2.17.0 PyJWT 2.8.0 pynvml 8.0.4 pyparsing 3.1.1 python-dateutil 2.8.2 python-editor 1.0.4 python-multipart 0.0.6 pytorch-lightning 2.1.2 pytz 2023.3.post1 PyYAML 6.0 readchar 4.0.5 referencing 0.31.0 regex 2023.5.5 requests 2.31.0 requests-oauthlib 1.3.1 resize-right 0.0.2 rich 13.7.0 rpds-py 0.13.0 rsa 4.9 s3transfer 0.7.0 scikit-image 0.22.0 scipy 1.11.3 sentry-sdk 1.35.0 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 soupsieve 2.5 stable-diffusion 0.0.1 /home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1 starlette 0.27.0 starsessions 1.3.0 sympy 1.12 tbb 2021.11.0 tdqm 0.0.1 tensorboard 2.11.2 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tifffile 2023.9.26 tokenizers 0.12.1 toolz 0.12.0 torch 2.0.1a0+gitf520939 torch-tb-profiler 0.4.0 torchaudio 2.0.1+3b40834 torchdata 0.6.1+e1feeb2 torchdiffeq 0.2.3 torchmetrics 1.2.0 torchsde 0.2.6 torchtext 0.15.2a0+4571036 torchvision 0.15.1a0+42759b1 tqdm 4.66.1 traitlets 5.13.0 trampoline 0.1.2 transformers 4.19.2 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 tzdata 2023.3 uc-micro-py 1.0.2 urllib3 1.26.18 uvicorn 0.24.0.post1 virtualenv 20.24.6 wandb 0.16.0 wcwidth 0.2.10 websocket-client 1.6.4 websockets 12.0 Werkzeug 3.0.1 wheel 0.41.3 yamllint 1.33.0 yarl 1.9.2

$ hl-smi +-----------------------------------------------------------------------------+ | HL-SMI Version: hl-1.12.1-fw-46.0.5.0 | | Driver Version: 1.12.1-cb7a7bc | |-------------------------------+----------------------+----------------------+ | AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. | |===============================+======================+======================| | 0 HL-205 N/A | 0000:90:1d.0 N/A | 0 | | N/A 49C N/A 104W / 350W | 512MiB / 32768MiB | 2% N/A | |-------------------------------+----------------------+----------------------+ | 1 HL-205 N/A | 0000:20:1d.0 N/A | 0 | | N/A 50C N/A 105W / 350W | 512MiB / 32768MiB | 3% N/A | |-------------------------------+----------------------+----------------------+ | 2 HL-205 N/A | 0000:20:1e.0 N/A | 0 | | N/A 50C N/A 102W / 350W | 512MiB / 32768MiB | 2% N/A | |-------------------------------+----------------------+----------------------+ | 3 HL-205 N/A | 0000:10:1e.0 N/A | 0 | | N/A 48C N/A 97W / 350W | 512MiB / 32768MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 4 HL-205 N/A | 0000:10:1d.0 N/A | 0 | | N/A 49C N/A 101W / 350W | 512MiB / 32768MiB | 1% N/A | |-------------------------------+----------------------+----------------------+ | 5 HL-205 N/A | 0000:90:1e.0 N/A | 0 | | N/A 48C N/A 108W / 350W | 512MiB / 32768MiB | 4% N/A | |-------------------------------+----------------------+----------------------+ | 6 HL-205 N/A | 0000:a0:1d.0 N/A | 0 | | N/A 46C N/A 104W / 350W | 512MiB / 32768MiB | 2% N/A | |-------------------------------+----------------------+----------------------+ | 7 HL-205 N/A | 0000:a0:1e.0 N/A | 0 | | N/A 40C N/A 108W / 350W | 512MiB / 32768MiB | 4% N/A | |-------------------------------+----------------------+----------------------+ | Compute Processes: AIP Memory | | AIP PID Type Process name Usage | |=============================================================================| | 0 N/A N/A N/A N/A | | 1 N/A N/A N/A N/A | | 2 N/A N/A N/A N/A | | 3 N/A N/A N/A N/A | | 4 N/A N/A N/A N/A | | 5 N/A N/A N/A N/A | | 6 N/A N/A N/A N/A | | 7 N/A N/A N/A N/A | +=============================================================================+

rofinn commented 3 months ago

I realize this is an old issue, but if you enable debug logging:

export LOG_LEVEL_ALL_PT=1
export ENABLE_CONSOLE=true
export LOG_LEVEL_ALL=3

Do you see output that looks like:

[18:25:56.211607][KERNEL_DB             ][error][tid:12636] Environment variable GC_KERNEL_PATH is undefined
[18:25:56.215119][KERNEL_DB             ][error][tid:12636] Environment variable GC_KERNEL_PATH is undefined
[18:25:56.215829][KERNEL_DB             ][error][tid:12636] Environment variable GC_KERNEL_PATH is undefined
[18:25:56.935917][EAGER                 ][error][tid:12636] Failed to init NOP kernel for gaudi2
[18:25:56.935975][EAGER                 ][error][tid:12636] Failed to init NOP kernel for gaudi3

We ran into a similar issue and needed to set

export GC_KERNEL_PATH="/usr/lib/habanalabs/libtpc_kernels.so"
victordion commented 3 months ago

Thanks @rofinn it is the reason for my case