Closed nktice closed 1 month ago
Can confirm, same GPU and same issue
I have the same issue, only the last line numbers are different.
I installed ROCm in my Ubuntu 24.04 distrobox: amdgpu-install_6.2.60202-1_all.deb
Followed the ComfyUI guide with the nighly verion.
OS posix Python Version 3.12.3 (main, Sep 11 2024, 14:17:37) [GCC 13.2.0] Embedded Python false Pytorch Version 2.6.0.dev20241020+rocm6.2 Arguments main.py RAM Total 31.26 GB RAM Free 18.61 GB Devices Name cuda:0 AMD Radeon RX 6950 XT : native Type cuda VRAM Total 15.98 GB VRAM Free 14.24 GB Torch VRAM Total 1.65 GB Torch VRAM Free 124 MB
Traceback (most recent call last):
File "/home/bp/Homebox/AI/ComfyUI/execution.py", line 323, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/execution.py", line 198, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/execution.py", line 169, in _map_node_over_list
process_inputs(input_dict, i)
File "/home/bp/Homebox/AI/ComfyUI/execution.py", line 158, in process_inputs
results.append(getattr(obj, func)(**inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/nodes.py", line 65, in encode
output = clip.encode_from_tokens(tokens, return_pooled=True, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/sd.py", line 125, in encode_from_tokens
o = self.cond_stage_model.encode_token_weights(tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/sdxl_clip.py", line 60, in encode_token_weights
g_out, g_pooled = self.clip_g.encode_token_weights(token_weight_pairs_g)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/sd1_clip.py", line 41, in encode_token_weights
o = self.encode(to_encode)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/sd1_clip.py", line 238, in encode
return self(tokens)
^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/sd1_clip.py", line 210, in forward
outputs = self.transformer(tokens, attention_mask_model, intermediate_output=self.layer_idx, final_layer_norm_intermediate=self.layer_norm_hidden_state, dtype=torch.float32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/clip_model.py", line 136, in forward
x = self.text_model(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/clip_model.py", line 112, in forward
x, i = self.encoder(x, mask=mask, intermediate_output=intermediate_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/clip_model.py", line 69, in forward
x = l(x, mask, optimized_attention)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/clip_model.py", line 50, in forward
x += self.self_attn(self.layer_norm1(x), mask, optimized_attention)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/clip_model.py", line 17, in forward
q = self.q_proj(x)
^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/ops.py", line 68, in forward
return self.forward_comfy_cast_weights(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bp/Homebox/AI/ComfyUI/comfy/ops.py", line 64, in forward_comfy_cast_weights
return torch.nn.functional.linear(input, weight, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Attempting to use hipBLASLt on a unsupported architecture!```
Hi @nktice. Internal ticket has been created to investigate your issue. Thanks!
Hi @adamrokickirns, @bpjoestar, I believe this is a known issue is related to https://github.com/pytorch/pytorch/issues/138067. It should be fixed in the latest PR https://github.com/pytorch/pytorch/pull/138267 which was only merged 2 days ago. What happened is that ROCm is supposed to not use hipBlASLt on unsupported hardware, but that behavior was accidentally removed in PR https://github.com/pytorch/pytorch/pull/137604. It should work with 20241021's nightly build.
By the way @nktice seems like you were able to confirm it works in the other thread, so I will be closing this issue. Please feel free to re-open if the issue persists. Thanks!
By the way @nktice seems like you were able to confirm it works in the other thread, so I will be closing this issue. Please feel free to re-open if the issue persists. Thanks!
That nightly did work... but @naromero77amd said that it wasn't right, and issue regressed. It does however look like they're aware of it, and working on it.
@nktice Thank you for the follow up! I think for the purpose of this particular issue, we can consider the issue is patched for now despite the regression. However, I will link the proper fix here to help keeping track of the progress. https://github.com/pytorch/pytorch/pull/138267
Thanks!
Problem Description
I maintain this guide for using some AI tools on AMD cards : http://github.com/nktice/AMD-AI There was an option to stop HipBLASLt from loading : "TORCH_BLAS_PREFER_HIPBLASLT=0" That had been working ( ROCm 6.2.1 / PyTorch 2.4.x ) ... alas this appears to not be functioning with PyTorch 2.6.x.
Both Stable Diffusion and ComfyUI that were previously working are now crashing with the following error - "RuntimeError: Attempting to use hipBLASLt on a unsupported architecture!" What had worked before, is no longer functional as a work around.
This is with torch==2.6.0.dev20241015+rocm6.2 torch==2.6.0.dev20241014+rocm6.2
[ and as default install for "torch" from https://download.pytorch.org/whl/nightly/torch/ at time of writing... ] I have reverted and tried with torch==2.5.0.dev20240912 torchvision==0.20.0.dev20240912+rocm6.2 torch==2.6.0.dev20240913 torchvision==0.20.0.dev20240913+rocm6.2 torch==2.6.0.dev20241011 torchvision==0.20.0.dev20241011+rocm6.2 ... torch==2.6.0.dev20241013 torchvision==0.20.0.dev20241013+rocm6.2 And do not get this error, and things work 'normally'
( Yes, I had to go and download pytorch's files to try each of those... )
Operating System
Ubuntu 24.04.1
CPU
AMD Ryzen 9 5950x
GPU
AMD Radeon RX 7900 XTX
Other
No response
ROCm Version
ROCm 5.7.1
ROCm Component
No response
Steps to Reproduce
Here is a link to the guide that uses nightlies and/or dev versions - https://github.com/nktice/AMD-AI/blob/main/dev.md
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module version 6.8.5 is loaded =====================
HSA System Attributes
=====================
Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES
==========
HSA Agents
==========
Agent 1
Name: AMD Ryzen 9 5950X 16-Core Processor Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 5950X 16-Core Processor Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3400
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65747224(0x3eb3918) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 65747224(0x3eb3918) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65747224(0x3eb3918) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx1100
Uuid: GPU-[redacted]
Marketing Name: Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2431
BDFID: 3072
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 342
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Agent 3
Name: gfx1100
Uuid: GPU-[redacted] Marketing Name: Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2431
BDFID: 3840
Internal Node ID: 2
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 342
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done
Additional Information
Here's the full output from ComfyUI