KGPE-D16 IOMMU Corruption

xloem commented 1 year ago

Dasharo version v0.4.0

Dasharo variant Asus KGPE-D16

Affected component(s) or functionality iommu?

Brief summary I am experiencing data corruption between PCI cards that resolves only when the kernel is booted with iommu=soft or iommu=off. The issue limits the use of the system for AI/ML.

How reproducible 100%

How to reproduce I am still working on narrowing the issue down to reduce the parts needed to reproduce, but only have my two Nvidia K80 cards to test IOMMU with at the moment.

Steps to reproduce the behavior:

Boot the system with a K80 nvidia pci card (i am presently testing with two installed on ubuntu 20)
install cuda-drivers-470 and cuda-toolkit-11-4 from nvidia
install python3-pip
install pytorch with python3 -m pip install torch

create test.py as follows:

import torch
t = torch.tensor([1,2])
t0 = t.to(0) # send t to device 0
t01 = t0.to(1) # send t0 to device 1
print(t)
print(t0)
print(t01)

python3 test.py

Expected behavior

The array should hold the same data on every device.

Actual behavior

The array is emptied to zeros after being transferred to the second device.

Screenshots

tensor([1, 2])
tensor([1, 2], device='cuda:0')
tensor([0, 0], device='cuda:1')

Additional context

I can mail 1-2 K80 cards to somebody to help work this issue, since the cards are not as useful to me while it is present. Only one is needed to reproduce it. Together they can run machine learning models with up to 80 billion parameters if 4 bit quantization is used.

Here is a long thread on issues like this: https://github.com/pytorch/pytorch/issues/1637 it links to a test script which hangs with 100% gpu usage instead of showing corruption: https://gist.github.com/zou3519/f13145cafbb873a855ef524d6607125a

Here is a thread on this issue: https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748

Here is a short thread on this issue: https://github.com/pytorch/pytorch/issues/84803 Differences with short thread:

In this thread, iommu is disabled in the bios to resolve the issue, whereas I am disabling it in the kernel

In this thread, the nccl-tests very slowly output errors, whereas for me they seem to simply hang. During this hang I see this kernel output:

[ 7884.078277] nvidia 0000:24:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xd41390b0 flags=0x0020]
[ 7884.078302] nvidia 0000:24:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xd41390b0 flags=0x0020]
[ 7884.127527] nvidia 0000:24:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xd41390b0 flags=0x0020]
[ 7884.127553] nvidia 0000:24:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000b address=0xd41390b0 flags=0x0020]
[ 7884.322841] amd_iommu_report_page_fault: 664 callbacks suppressed
[ 7884.322846] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x000a address=0xd8139070 flags=0x0020]
[ 7884.322869] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x000a address=0xd8139070 flags=0x0020]
[ 7884.323392] AMD-Vi: Event logged [IO_PAGE_FAULT device=ab:00.0 domain=0x000e address=0xd7139070 flags=0x0020]
[ 7884.323417] AMD-Vi: Event logged [IO_PAGE_FAULT device=ab:00.0 domain=0x000e address=0xd7139070 flags=0x0020]

I meant to add more tools than the below, but below is what ended up here:

for tracing mmio data: https://wiki.ubuntu.com/X/MMIOTracing
for p2pdma performance testing: https://github.com/sbates130272/p2pmem-test

Solutions you've tried

Nvidia recommends disabling a feature called peer-to-peer. When I tried to disable this I did not see any change. Disabling iommu by passing a kernel parameter resolves the issue. Unfortunately this makes it prohibitively slow to transfer data between the cards. Disabling ACS with setpci -s xx:yy.z ecap_acs+6.w=0 helps with the slowness.

zirblazer commented 1 year ago

Somehow I actually think than the IOMMU issues are related to PCI MMIO allocation. There shouldn't be Above 4G MMIO support in Coreboot and it seems to be required. These are not "Resizeable BAR" cards that can work with 256 MiB MMIO or a bigger one but "Large BAR", which is a single, very big size. https://forums.developer.nvidia.com/t/plugging-tesla-k80-results-in-pci-resource-allocation-error/37007/10 https://sjgf.medium.com/a-tesla-k80-and-ubuntu-in-a-consumer-motherboard-ab0edbf0e0d1 https://www.reddit.com/r/nvidia/comments/mkyozp/comment/gtjqngi/ https://www.reddit.com/r/homelab/comments/g3zo9z/nvidia_tesla_k80_not_working_in_one_of_my_servers/ https://www.reddit.com/r/homelab/comments/12g3hkp/recently_purchased_a_tesla_k80/ https://www.reddit.com/r/homelab/comments/q88qa9/motherboard_large_bar_support/

xloem commented 1 year ago

The dasharo coreboot is doing Above 4G fine for me. This is actually the reason I am using this firmware. EDIT: But I am not familiar with IOMMUs and MMIO allocation and imagine the hardware IOMMU on the board may not have been designed for these cards. here's the tail of a conversation on large pci mapping from 2016: https://coreboot.coreboot.narkive.com/9o8wc1ym/discussion-about-dynamic-pci-mmio-size-on-x86#post16

zirblazer commented 1 year ago

The dasharo coreboot is doing Above 4G fine for me. This is actually the reason I am using this firmware.

How can you confirm this?

I had issues with 2 x Radeon 5600XT on the MSI, which is more than a full decade newer than yours. The problems manifest depending on OS (Windows vs Linux) and Linux Kernel version: https://github.com/Dasharo/dasharo-issues/issues/245

Linux seems to be able to pretty much fully reallocate PCI MMIO resources AFTER boot so you can get scenarios where Firmware does NOT support Above 4G / ReBAR out of the box but Linux takes care of enabling it on its side. I suppose than there should be Kernel parameters to make it honor Firmware allocated resources. My theory is that Linux is reallocating resources but fail to reconfigure the IOMMU with the new values so there is a mismatch.

xloem commented 1 year ago

I received the BAR errors you linked in the raptor firmware, and do not receive them on the dasharo. I loaded these devices with 40GB of data and store and accessed it reliably on them for days. But only with iommu=soft so far. I hope to learn how to know these things better.

miczyg1 commented 1 year ago

Would be great to have a coreboot log and kernel dmesg (you can remove any privacy-sensitive data as you please) if possible.

For the coreboot log you can gather it with another machine by connecting to KGPE's RS232 serial port exposed on the board during power on, or alternatively use cbmem utility from a running Linux on the KGPE-D16 like described here: https://docs.dasharo.com/common-coreboot-docs/dumping_logs/#cbmem-utility

xloem commented 1 year ago

Would be great to have a coreboot log and kernel dmesg (you can remove any privacy-sensitive data as you please) if possible.

For the coreboot log you can gather it with another machine by connecting to KGPE's RS232 serial port exposed on the board during power on, or alternatively use cbmem utility from a running Linux on the KGPE-D16 like described here: https://docs.dasharo.com/common-coreboot-docs/dumping_logs/#cbmem-utility

coreboot.log 2023-05-09T13:25:06+00:00.dmesg.log

I've paused updating this issue, as I've noticed the symptoms change depending on kernel version and ACSCtl flag on pci hubs. The transfers can be corrupt or intact and slow or fast or faster and I haven't figured out what depends on what yet. I believe the above files produce the output described in the original post.

For example, I think I found that if I keep data within the plx ~hub~ switch inside a single card, I can lose the corruption and gain speed if I disable its ACSCtl in linux 5.4, but if i upgrade to linux 6.1 this strangely no longer works.

xloem commented 1 year ago

Workaround for running K80s in parallel on the KGPE-D16:

Run a kernel no later than 2c97b5ae which is a little after v5.4, in order to let the ACSCtl hack work. Bisecting this took a long time and I may have made a mistake. The commit after it is a merge which could use further bisection to identify the impactful change. (on my kernel I have P2P disabled which seems to be the default, enabling it also stops the ACSCtl hack from working).
On Ubuntu, install cuda-drivers-470 and cuda-toolkit-11-4 and hold libnccl2 and libnccl-dev at version 2.11.4-1+cuda-11.4. NCCL needs to be built with the right cuda version to run, and manages inter-PCI communication.
Disable ACSCtl on the PLXs associated with the K80s: for dev in 22:08.0 22:10.0 a9:08.0 a9:10.0; do setpci -s $dev ecap_acs+6.w=0; done . This gets high bandwidth between their onboard cards for me when tested with p2pBandwidthTest from the cuda examples.
Export the environment variable NCCL_P2P_LEVEL=PIX. This tells NCCL to only use PCI P2P between cards that share the same immediate parent switch. This gets all_reduce_perf from the nccl examples to run without hanging or corrupting for me.

Dasharo / dasharo-issues

KGPE-D16 IOMMU Corruption #436