evshiron / rocm_lab

DEPRECATED!
https://are-we-gfx1100-yet.github.io
Other
53 stars 7 forks source link

Copy to VRAM hanging #4

Closed mergmann closed 1 year ago

mergmann commented 1 year ago

I first tried using stable diffusion with pytorch/rocm5.4.2. That didn't work (it hangs indefinetely when copying data to VRAM) since my RX7900 XT is not officially supported by ROCm 5.4. Then I tried compiling pytorch with rocm5.5 myself in a docker container. 2h later, I got the same problem. Then i tried the prebuilt wheels from this repo (with automatic and deepfloyd) and the docker containers (a1111 and automatic), still no success. Even a simple script like this hangs:

import torch

print('Ok  :', torch.cuda.is_available())
print('CUDA:', torch.version.cuda)
print('HIP :', torch.version.hip)

print('Num :', torch.cuda.device_count())
device = torch.device(0)
print('Dev :', device)

print('Creating tensor')
tensor = torch.Tensor([1., 2., 3., 4.])
print('Copy tensor')
tensor.to(device) # hangs
print('Tensor copied')

Output:

import torch

print('Ok  :', torch.cuda.is_available())
print('CUDA:', torch.version.cuda)
print('HIP :', torch.version.hip)

print('Num :', torch.cuda.device_count())
device = torch.device(0)
print('Dev :', device)

print('Creating tensor')
tensor = torch.Tensor([1., 2., 3., 4.])
print('Copy tensor')
tensor.to(device) # hangs
print('Tensor copied')

dmesg doesn't show anything, rocminfo and rocm-smi aren't helpful either. radeontop doesn't show a difference to normal usage. With htop I can see that torch maxes out a single core. Probably it is stuck in an infinite loop? GDB backtrace:

(gdb) bt
#0  0x00007f1d28a5aed9 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#1  0x00007f1d28a5ad5e in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007f1d28a4f8a1 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007f1d28a29e01 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#4  0x00007f1d28a43f70 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#5  0x00007f1d28a7d1c2 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#6  0x00007f1d28a7c8ab in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#7  0x00007f1d28a521fc in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#8  0x00007f1d383b2d03 in ?? () from /opt/rocm/lib/libroctracer64.so.4
#9  0x00007f1d383bbb83 in ?? () from /opt/rocm/lib/libroctracer64.so.4
#10 0x00007f1d290bebe3 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#11 0x00007f1d2910875d in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#12 0x00007f1d290f6adb in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#13 0x00007f1d290b42f1 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#14 0x00007f1d29108d70 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#15 0x00007f1d290107e7 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#16 0x00007f1d28ea8b69 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#17 0x00007f1d28f54e14 in hipMemcpyWithStream () from /opt/rocm/hip/lib/libamdhip64.so.5
#18 0x00007f1d0a83402a in at::native::copy_kernel_cuda(at::TensorIterator&, bool) () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#19 0x00007f1d1629edbe in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) [clone .isra.0] () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007f1d1629ffc0 in at::native::copy_(at::Tensor&, at::Tensor const&, bool) () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007f1d16f6250c in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007f1d1657289b in at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()

Is this a common bug? I think, it might actually be a broken amdgpu/rocm installation?

evshiron commented 1 year ago

I have not encountered this situation before. dmesg should say something about the weirdness.

On my end, I only tested on Ubuntu 22.04, and installing the AMDGPU driver on Ubuntu is simple:

After restarting, information about the GPU should be visible in rocminfo. At this point, running the docker run command here should be able to run the relevant applications.

The above are all my experiences. If you are interested, you can install an Ubuntu and try to follow the above steps to see if it works.

Afaik, the official images/wheels don't have gfx1100 support, and should fail quickly. If the GPU becomes unrecoverable, a reboot will help.

mergmann commented 1 year ago

I'm using EndeavourOS (Arch), I might try ubuntu on a usb drive if I still won't get it to work. Oh well, even Blender hangs when using GPU compute.

mergmann commented 1 year ago

I just realized that I'm having rocm 5.4.3 installed, 5.5 for arch is in development. That means I have to wait a few days until it is released. I hope it'll work then. I thought I had the latest version of the runtime installed, so it's my mistake.

evshiron commented 1 year ago

Hmmm. I can find things like https://archlinux.org/packages/extra-staging/x86_64/rocm-ml-sdk/, but I don't know if it can be installed or not. Anyway, wish you good luck!

mergmann commented 1 year ago

Nah, installing from a staging repo is not a good idea.

ewof commented 1 year ago

tried with rocm-ml-sdk on artix still bugged

mergmann commented 1 year ago

I somehow got the full backtrace from gdb, it also knows the functions where it got stuck. Probably not helping much, but it is interesting to find the bug. I guess it is a busy loop? I might try looking into the code. It could be this one: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/6fdf759273a098829dfd642fb730ea410f33b152/src/core/runtime/interrupt_signal.cpp#L139

#0  0x00007f731904d5cf in rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#1  0x00007f731904d48a in rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#2  0x00007f7319041979 in rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#3  0x00007f731901dc70 in rocr::AMD::BlitKernel::SubmitLinearCopyCommand(void*, void const*, unsigned long) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#4  0x00007f7319036525 in rocr::(anonymous namespace)::RegionMemory::Freeze() () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#5  0x00007f731906eb44 in rocr::amd::hsa::loader::Segment::Freeze() [clone .part.29] () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#6  0x00007f731906ebbf in rocr::amd::hsa::loader::ExecutableImpl::Freeze(char const*) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#7  0x00007f731906e2a8 in rocr::amd::hsa::loader::AmdHsaCodeLoader::FreezeExecutable(rocr::amd::hsa::loader::Executable*, char const*) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#8  0x00007f7319045de7 in rocr::HSA::hsa_executable_freeze(hsa_executable_s, char const*) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#9  0x00007f733ae07ccf in roctracer::hsa_support::(anonymous namespace)::ExecutableFreezeIntercept(hsa_executable_s, char const*) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libroctracer64.so
#10 0x00007f733ae108fc in roctracer::hsa_support::detail::hsa_executable_freeze_callback(hsa_executable_s, char const*) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libroctracer64.so
#11 0x00007f73654c3f9f in roc::LightningProgram::setKernels(void*, unsigned long, int, unsigned long, std::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#12 0x00007f7365481a66 in device::Program::loadLC() () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#13 0x00007f7365481b1f in device::Program::load() () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#14 0x00007f73654ac334 in amd::Program::load(std::vector<amd::Device*, std::allocator<amd::Device*> > const&) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#15 0x00007f736547effc in amd::Device::BlitProgram::create(amd::Device*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#16 0x00007f73654baff1 in roc::Device::createBlitProgram() () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#17 0x00007f73654fe260 in roc::KernelBlitManager::createProgram(roc::Device&) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#18 0x00007f73654d03fd in roc::VirtualGPU::create() () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#19 0x00007f73654b6353 in roc::Device::createVirtualDevice(amd::CommandQueue*) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#20 0x00007f73654a54d0 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) ()
   from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#21 0x00007f736540028e in hip::Stream::Create() () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#22 0x00007f7365400580 in hip::Stream::asHostQueue(bool) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#23 0x00007f736529ae2e in hip::Device::NullStream(bool) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#24 0x00007f736537e9cd in hipMemcpyWithStream () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#25 0x00007f73672d24f8 in at::native::copy_kernel_cuda(at::TensorIterator&, bool) () from /home/mattisb/Programming/AI/deepfloyd-if/.venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
mergmann commented 1 year ago

Weird. I tried running Stable Diffusion with the latest ROCm 5.6 on ubuntu and endeavouros, now it doesn't hang anymore, but AUTOMATIC1111 errors with "A tensor with all NaNs was produced in VAE." or "A tensor with all NaNs was produced in Unet." I tried using "--no-half" and "--precision full", but that didn't help. I debugged it a bit further with ComfyUI and found out that all used models tend to return values like -e+38, e+38, -inf, inf, nan. Those values propagate through to other networks. For example CLIP might return e+38, then ksampler likely returns inf or nan. If CLIP returns "normal" values, ksampler returns e+38 or inf, so the the VAE produces nans. I don't know why that happens both on arch and ubuntu 22.04? My GPU is connected to a PCIe 4 slot that is also set to PCIe 4 mode. When running it on the cpu (e.g. with --cpu in ComfyUI, everything is fine) Example: Code: print('Values after CLIP', np.unique(cond, return_counts=True)) Output: Values after CLIP (array([nan], dtype=float32), array([59136]))

evshiron commented 1 year ago

@MattisBergmann

Weird. Does torch works for simpler code?

Btw, did you try setting HSA_OVERRIDE_GFX_VERSION=11.0.0 and HIP_VISIBLE_DEVICES=0 for your Navi 31 GPU before launching WebUI? This should be set for every application to ensure best compatibility.

The number of HIP_VISIBLE_DEVICES=0 may vary. You can find the order via rocminfo and it starts from 0.

mergmann commented 1 year ago

Weird. Does torch works for simpler code?

Yes, running simple operations like adding, multipling, etc. on tensors works.

Btw, did you try setting HSA_OVERRIDE_GFX_VERSION=11.0.0 and HIP_VISIBLE_DEVICES=0 for your Navi 31 GPU before launching WebUI? This should be set for every application to ensure best compatibility.

I tried it with and without those options, nothing changed.

I'm on vacation rn, so I hope when I'm back the drivers will be on the stable and pytorch will have 5.5 on stable as well.

evshiron commented 1 year ago

I am closing this issue. If you still have issue running Stable Diffusion, you can try:

git clone https://github.com/vladmandic/automatic
cd automatic
./webui.sh --debug

Which should now provide a smooth out-of-box experience.

ewof commented 1 year ago

did something change to make it work?

evshiron commented 1 year ago

Arch Linux should now have modern ROCm packages. I am not sure what doesn't work.

ewof commented 1 year ago

i forgot abt my own duplicate issue where i fixed it by switching to ubuntu lol my bad

mergmann commented 1 year ago

Even with rocm5.6 it won't work, I just get RuntimeError: HIP error: the operation cannot be performed in the present state on arch, on ubuntu, it fails to install rocm, I might try to reinstall ubuntu and try it again. It is just very annoying to work with 2 bootloaders :/ (I use EndeavourOS with systemd-boot and ubuntu with GRUB). I guess the problem lies somewhere else, neither in pytorch, nor in rocm.

evshiron commented 1 year ago

@MattisBergmann

mergmann commented 1 year ago

Is your current user in both video and render groups?

yes, but I added that after installing rocm

Can you run rm -r venv sdnext.log && ./webui.sh --debug, and then post sdnext.log here?

I ran that command both with and without TORCH_COMMAND="--pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6" With rocm5.4.2, it segfaults

sdnext_rocm542.log sdnext_rocm56.log

evshiron commented 1 year ago

@MattisBergmann

You don't need to specify TORCH_COMMAND="--pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6".

./webui.sh will now take care of all of it.

Could you unset that and try again?

mergmann commented 1 year ago

I git pulled the latest version and it installed rocm 5.4.2.

Could you unset that and try again?

I already did that. That is the log "sdnext_rocm542.log"

evshiron commented 1 year ago

@MattisBergmann

Ouch. I am sorry. Can you post the outputs of rocm_agent_enumerator and rocminfo?

And hipconfig --version as well.

If you are using Arch Linux variants, installing rocm-ml-sdk should have you covered.

mergmann commented 1 year ago

rocm_agent_enumerator

gfx000
gfx1100

rocminfo rocminfo.txt

hipconfig --version 5.6.31061-

evshiron commented 1 year ago

@MattisBergmann

Can you run rm -r venv sdnext.log && ./webui.sh --debug again?

I am checking the very beginning of the logs, which says "AMD ROCm toolkit detected", and there will be some info about what devices are detected and which is used, which depends on the commands in the previous reply.

mergmann commented 1 year ago

with rocm-ml-sdk it is still not working, I'm getting the same issue.

evshiron commented 1 year ago
$ ./webui.sh --debug
Create and activate python venv
Launching launch.py...
02:11:57-693728 INFO     Starting SD.Next                                                                                                                                
02:11:57-695474 INFO     Python 3.10.12 on Linux                                                                                                                         
02:11:57-697494 INFO     Version: 22dc42fd Tue Aug 15 01:07:00 2023 +0800                                                                                                
02:11:57-927786 INFO     Latest published version: fce48be440b888ce4ceb27f4d081454d6cc8fd2b 2023-08-14T07:58:42Z                                                         
02:11:57-928447 DEBUG    Setting environment tuning                                                                                                                      
02:11:57-928928 DEBUG    Torch overrides: cuda=False rocm=False ipex=False diml=False                                                                                    
02:11:57-929432 DEBUG    Torch allowed: cuda=True rocm=True ipex=True diml=True                                                                                          
02:11:57-929996 INFO     AMD ROCm toolkit detected                                                                                                                       
02:11:57-945125 DEBUG    ROCm agents detected: ['gfx1100', 'gfx1036']                                                                                                    
02:11:57-945751 DEBUG    ROCm agent used by default: idx=0 gpu=gfx1100 arch=navi3x                                                                                       
02:11:57-971205 DEBUG    ROCm version detected: 5.6

What's your output after installing rocm-ml-sdk and all of those commands are available?

mergmann commented 1 year ago

after unsetting TORCH_COMMAND, it shows that. sdnext.log

evshiron commented 1 year ago

@MattisBergmann

It doesn't seem to be unset actually, because if TORCH_COMMAND is set, "AMD ROCm toolkit detected" will not be printed in the log, according to https://github.com/vladmandic/automatic/blob/master/installer.py#L309, which is what your log indicating.

Would you mind starting a new terminal window, or double check if it's really unset with the export command, and try again?

mergmann commented 1 year ago

Oh I guess the problem was that I forgot to remove the log so it contained all the previous runs. Here is the new log: sdnext.log

mergmann commented 1 year ago

I'll make a fresh install of ubuntu tomorrow and I'll try to get it to run there.

evshiron commented 1 year ago

@MattisBergmann

I lose.

One last check, did dmesg say anything about the failure?

mergmann commented 1 year ago

No, the newest entries from dmesg are from the last boot

evshiron commented 1 year ago

@MattisBergmann

I am curious do /dev/kfd and /dev/dri exist?

If you are going to try it on a fresh Ubuntu, would you mind doing this:

# install dependencies
sudo apt update && sudo apt install -y git python3-pip python3-venv python3-dev libstdc++-12-dev

# install the amdgpu driver with rocm support
curl -O https://repo.radeon.com/amdgpu-install/5.6/ubuntu/jammy/amdgpu-install_5.6.50600-1_all.deb
sudo dpkg -i amdgpu-install_5.6.50600-1_all.deb

# opencl might cause issues later, so skip it unless you need it
sudo amdgpu-install --usecase=graphics,rocm

# grant current user the access to gpu devices
sudo usermod -aG video $USER
sudo usermod -aG render $USER

# reboot is needed to make both driver and user group take effect
sudo reboot
git clone https://github.com/vladmandic/automatic
cd automatic
./webui.sh --debug

Sources:

Sorry for taking your so much time here.

mergmann commented 1 year ago

ls -l /dev/kfd /dev/dri

crw-rw-rw- 1 root render 234, 0 14. Aug 20:00 /dev/kfd

/dev/dri:
total 0
drwxr-xr-x  2 root root         80 14. Aug 20:00 by-path
crw-rw----+ 1 root video  226,   1 14. Aug 20:00 card1
crw-rw-rw-  1 root render 226, 128 14. Aug 20:00 renderD128
mergmann commented 1 year ago

Sorry for taking your so much time here.

No problem, I also want to have it fixed, thanks for your help

mergmann commented 1 year ago

Even on a fresh ubuntu install, it is still the same error. However when installing rocm with amdgpu-install, It showed some warnings W: Possible missing firmware /lib/firmware/amdgpu/<file>.bin for module amdgpu I don't have the exact warnings, but it looked similar to https://askubuntu.com/questions/1124253/missing-firmware-for-amdgpu I downloaded the newest firmware from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git and reinstalled amdgpu, yet it didn't resolve all of those warnings. After rebooting, I still get RuntimeError: HIP error: the operation cannot be performed in the present state

mergmann commented 1 year ago

I hope, that I don't have a faulty card or mainboard. What I could try is putting the gpu in the PCIe 3.0 slot, but that is some work as it doesn't fit very well with the mainboard and case.

evshiron commented 1 year ago

@MattisBergmann

Those warnings should be safe to ignore I guess. But it's weird that RuntimeError: HIP error: the operation cannot be performed in the present state still happens on Ubuntu.

What's the vendor of your RX 7900 XT?

mergmann commented 1 year ago

The card is an XFX SPEEDSTER MERC 310 AMD Radeon™ RX 7900 XT (So the vendor is XFX)

evshiron commented 1 year ago

@MattisBergmann

The following code snippet is from:

Would you mind trying this and see from which line it starts to fail?

$ export HIP_VISIBLE_DEVICES=0
$ export HSA_OVERRIDE_GFX_VERSION=11.0.0
$ source venv/bin/activate
$ python3.10
import torch
device='cuda' # None?

rnd = torch.sum(torch.randn(2, 2)).to(device)
print(rnd)

x = torch.tensor([[1.5,.0,.0,.0]]).to(device).half()
layerNorm = torch.nn.LayerNorm(4, eps=0.00001, elementwise_affine=True, dtype=torch.float16, device=device)
y = layerNorm(x)
print(y)
evshiron commented 1 year ago

I made a fresh Arch Linux installation just now, by following the Arch Wiki:

On my end it works just fine.

I can't believe it's a hardware issue, but I don't have other ideas now.

mergmann commented 1 year ago

I have found out that I can enable the log with AMD_LOG=<log level> Setting it to 1 (error), reveals the error:

:1:rocvirtual.cpp           :2902: 4030434843 us: 14659: [tid:0x7f3dd48fe6c0] Pcie atomics not enabled, hostcall not supported
:1:rocvirtual.cpp           :3235: 4030434846 us: 14659: [tid:0x7f3dd48fe6c0] AQL dispatch failed!

I have actually never heard of PCIe atomics before, but it seems that ROCm requires them. And I can't find much information about which mainboards and CPU actually support them. I gues my mainboard/cpu doesn't. Mainboard: ASUS PRIME B560 PLUS CPU: Intel i5-11400f

evshiron commented 1 year ago

I have actually never heard of PCIe atomics before, but it seems that ROCm requires them.

This is true. Glad you found them. Your hardware doesn't look outdated.

According this, PCIe Atomics is introduced in PCI-E 3.0.

Maybe some BIOS tweakings?

See also:

ghost commented 1 year ago

@MattisBergmann @evshiron I have encountered the same issue (but I'm not trying to run stable diffusion or anything in this example, just basic pytorch) I'll share my setup:

  1. Asus ROG Strix B550-F motherboard (2 PCIe x16 lanes)
  2. AMD Ryzen 5 5600X 6-Core Processor
  3. 1300W evga supernova PSU
  4. I have two Radeon 7900 XTX's. My goal has been to try to chain them together to run LLM's and other models that require more VRAM. (So far unsuccessful)
  5. I have tried on Pop!_OS 22.04 LTS, Fedora 38 (with rocm 5.5.x packages) and Fedora Rawhide (with 5.6.x packages). Same results on all platforms.

When running with AMD_LOG_LEVEL=3 and HIP_VISIBLE_DEVICES=0, I get the following as part of the verbose stack trace/debug output:

rocvirtual.cpp           :2902: 1256580953 us: 19667: [tid:0x7f1b8eb82740] Pcie atomics not enabled, hostcall not supported                                                                                               
:1:rocvirtual.cpp           :3235: 1256580956 us: 19667: [tid:0x7f1b8eb82740] AQL dispatch failed!                                                                                                                                            
:3:hip_module.cpp           :663 : 1256580959 us: 19667: [tid:0x7f1b8eb82740] hipLaunchKernel: Returned hipErrorIllegalState :                                                               
:3:hip_error.cpp            :27  : 1256580963 us: 19667: [tid:0x7f1b8eb82740]  hipGetLastError (  )                                                                                                                                           
:3:hip_error.cpp            :27  : 1256580966 us: 19667: [tid:0x7f1b8eb82740]  hipGetLastError (  )                                                                                                                                           
:3:hip_device_runtime.cpp   :561 : 1256583779 us: 19667: [tid:0x7f1b8eb82740]  hipSetDevice ( 0 )                                                                                                                                             
:3:hip_device_runtime.cpp   :565 : 1256583783 us: 19667: [tid:0x7f1b8eb82740] hipSetDevice: Returned hipSuccess :

However, setting HIP_VISIBLE_DEVICES=1 (to the first PCIe 16 card) works fine, so the motherboard is maybe only enabling atomics for one of the two PCIe x16 slots?

Seems odd. I am out of my depth here. I contacted AMD support last week but haven't heard back from them since it got escalated to a supervisor. They were helpful throughout our conversation - the impression I got was that they don't really get many questions about rocm so it was something they needed to escalate in order to get any answers on.

I'll see if I can explore my options for PCIe atomics as well, this is the first thread I've come across that mentions this since I started trying to work with these 7900 XTX cards. If I find anything, I'll follow up.

If anyone is more knowledgeable than I am on the feasibility of this kind of consumer-grade dual GPU setup, I would love to hear suggestions.

P.S. @evshiron thanks for your blog posts about Are We GFX1100 Yet? - they have been very helpful in debugging some issues with pytorch.

evshiron commented 1 year ago

@codinglife9531

Thanks for reaching out! I am glad it has been helpful.

According to these links:

It seems that only the PCI-E lanes from the CPU support PCIe Atomics, which might be why your configuration is not working.

High-end customer motherboards like ROG STRIX X670E-E GAMING WIFI (see "Expansion Slots", not a recommendatoin) allow splitting and running in x8/x8 mode when two slots are used. As both slots come out from the CPU, I guess both of them support PCIe Atomics.

I am not sure if we can split from x16 to x8/x8 via an external PCI-E splitter while preserving support for PCIe Atomics.

ghost commented 1 year ago

@evshiron Thanks for your response. I'll dig into it further.

Still learning here, but in your experience/opinion, do you think that splitting into x8/x8 (edit: not via an external splitter btw) will impact something like LLM inferencing substantially? I'd assume that x16 would be preferable.

evshiron commented 1 year ago

@codinglife9531

My RX 7900 XTX used to work on a B450M motherboard, which is PCI-E 3.0 x16. Now it's running on PCI-E 4.0 x16. After the upgrade, I noticed that the inference performance of GPTQ doubled, so I believe that bandwidth does have a significant impact in LLM scenarios, but I didn't observe a significant difference in Stable Diffusion scenarios.

In my country, there are online platforms that sell outdated used server motherboards and CPUs. They are much cheaper compared to brand new ones, making them perfect for tinkering. Server configurations usually support a much larger number of PCI-E slots, so perhaps you can consider exploring in that direction.

mergmann commented 1 year ago

I already have it in the slot that comes from the cpu, it is in the pcie 4.0 slot. It might have to do with the NVMe SSD, but the data sheet states that my cpu has 20 lanes available from which 16 go to the gpu and 4 to the SSD.

evshiron commented 1 year ago

@MattisBergmann

Yes. The situation you are currently facing is quite weird, but I have exhausted all my ideas now.

evshiron commented 1 year ago

Some other info:

lspci -vvv info for my RX 7900 XTX:

                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Form Factor Dev Specific, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
mergmann commented 1 year ago

Well, for me it shows AtomicOpsCtl: ReqEn-. I don't know exactly, but I would interpret it as not supported. Although it is connected to an x16 slot directly from the cpu

                LnkSta: Speed 16GT/s, Width x16
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Form Factor Dev Specific, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
evshiron commented 1 year ago

@MattisBergmann

Maybe the setpci command here will help you:

But it looks magical and I don't know if rebooting will revert the change.