Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
24 stars 0 forks source link

KGPE-D16 IOMMU Corruption #436

Open xloem opened 1 year ago

xloem commented 1 year ago

Dasharo version v0.4.0

Dasharo variant Asus KGPE-D16

Affected component(s) or functionality iommu?

Brief summary I am experiencing data corruption between PCI cards that resolves only when the kernel is booted with iommu=soft or iommu=off. The issue limits the use of the system for AI/ML.

How reproducible 100%

How to reproduce I am still working on narrowing the issue down to reduce the parts needed to reproduce, but only have my two Nvidia K80 cards to test IOMMU with at the moment.

Steps to reproduce the behavior:

  1. Boot the system with a K80 nvidia pci card (i am presently testing with two installed on ubuntu 20)
  2. install cuda-drivers-470 and cuda-toolkit-11-4 from nvidia
  3. install python3-pip
  4. install pytorch with python3 -m pip install torch
  5. create test.py as follows:
    import torch
    t = torch.tensor([1,2])
    t0 = t.to(0) # send t to device 0
    t01 = t0.to(1) # send t0 to device 1
    print(t)
    print(t0)
    print(t01)
  6. python3 test.py

Expected behavior

The array should hold the same data on every device.

Actual behavior

The array is emptied to zeros after being transferred to the second device.

Screenshots

tensor([1, 2])
tensor([1, 2], device='cuda:0')
tensor([0, 0], device='cuda:1')

Additional context

I can mail 1-2 K80 cards to somebody to help work this issue, since the cards are not as useful to me while it is present. Only one is needed to reproduce it. Together they can run machine learning models with up to 80 billion parameters if 4 bit quantization is used.

Here is a long thread on issues like this: https://github.com/pytorch/pytorch/issues/1637 it links to a test script which hangs with 100% gpu usage instead of showing corruption: https://gist.github.com/zou3519/f13145cafbb873a855ef524d6607125a

Here is a thread on this issue: https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748

Here is a short thread on this issue: https://github.com/pytorch/pytorch/issues/84803 Differences with short thread:

I meant to add more tools than the below, but below is what ended up here:

Solutions you've tried

Nvidia recommends disabling a feature called peer-to-peer. When I tried to disable this I did not see any change. Disabling iommu by passing a kernel parameter resolves the issue. Unfortunately this makes it prohibitively slow to transfer data between the cards. Disabling ACS with setpci -s xx:yy.z ecap_acs+6.w=0 helps with the slowness.

zirblazer commented 1 year ago

Somehow I actually think than the IOMMU issues are related to PCI MMIO allocation. There shouldn't be Above 4G MMIO support in Coreboot and it seems to be required. These are not "Resizeable BAR" cards that can work with 256 MiB MMIO or a bigger one but "Large BAR", which is a single, very big size. https://forums.developer.nvidia.com/t/plugging-tesla-k80-results-in-pci-resource-allocation-error/37007/10 https://sjgf.medium.com/a-tesla-k80-and-ubuntu-in-a-consumer-motherboard-ab0edbf0e0d1 https://www.reddit.com/r/nvidia/comments/mkyozp/comment/gtjqngi/ https://www.reddit.com/r/homelab/comments/g3zo9z/nvidia_tesla_k80_not_working_in_one_of_my_servers/ https://www.reddit.com/r/homelab/comments/12g3hkp/recently_purchased_a_tesla_k80/ https://www.reddit.com/r/homelab/comments/q88qa9/motherboard_large_bar_support/

xloem commented 1 year ago

The dasharo coreboot is doing Above 4G fine for me. This is actually the reason I am using this firmware. EDIT: But I am not familiar with IOMMUs and MMIO allocation and imagine the hardware IOMMU on the board may not have been designed for these cards. here's the tail of a conversation on large pci mapping from 2016: https://coreboot.coreboot.narkive.com/9o8wc1ym/discussion-about-dynamic-pci-mmio-size-on-x86#post16

zirblazer commented 1 year ago

The dasharo coreboot is doing Above 4G fine for me. This is actually the reason I am using this firmware.

How can you confirm this?

I had issues with 2 x Radeon 5600XT on the MSI, which is more than a full decade newer than yours. The problems manifest depending on OS (Windows vs Linux) and Linux Kernel version: https://github.com/Dasharo/dasharo-issues/issues/245

Linux seems to be able to pretty much fully reallocate PCI MMIO resources AFTER boot so you can get scenarios where Firmware does NOT support Above 4G / ReBAR out of the box but Linux takes care of enabling it on its side. I suppose than there should be Kernel parameters to make it honor Firmware allocated resources. My theory is that Linux is reallocating resources but fail to reconfigure the IOMMU with the new values so there is a mismatch.

xloem commented 1 year ago

I received the BAR errors you linked in the raptor firmware, and do not receive them on the dasharo. I loaded these devices with 40GB of data and store and accessed it reliably on them for days. But only with iommu=soft so far. I hope to learn how to know these things better.

miczyg1 commented 1 year ago

Would be great to have a coreboot log and kernel dmesg (you can remove any privacy-sensitive data as you please) if possible.

For the coreboot log you can gather it with another machine by connecting to KGPE's RS232 serial port exposed on the board during power on, or alternatively use cbmem utility from a running Linux on the KGPE-D16 like described here: https://docs.dasharo.com/common-coreboot-docs/dumping_logs/#cbmem-utility

xloem commented 1 year ago

Would be great to have a coreboot log and kernel dmesg (you can remove any privacy-sensitive data as you please) if possible.

For the coreboot log you can gather it with another machine by connecting to KGPE's RS232 serial port exposed on the board during power on, or alternatively use cbmem utility from a running Linux on the KGPE-D16 like described here: https://docs.dasharo.com/common-coreboot-docs/dumping_logs/#cbmem-utility

coreboot.log 2023-05-09T13:25:06+00:00.dmesg.log

I've paused updating this issue, as I've noticed the symptoms change depending on kernel version and ACSCtl flag on pci hubs. The transfers can be corrupt or intact and slow or fast or faster and I haven't figured out what depends on what yet. I believe the above files produce the output described in the original post.

For example, I think I found that if I keep data within the plx ~hub~ switch inside a single card, I can lose the corruption and gain speed if I disable its ACSCtl in linux 5.4, but if i upgrade to linux 6.1 this strangely no longer works.

xloem commented 1 year ago

Workaround for running K80s in parallel on the KGPE-D16: