Open xloem opened 1 year ago
Somehow I actually think than the IOMMU issues are related to PCI MMIO allocation. There shouldn't be Above 4G MMIO support in Coreboot and it seems to be required. These are not "Resizeable BAR" cards that can work with 256 MiB MMIO or a bigger one but "Large BAR", which is a single, very big size. https://forums.developer.nvidia.com/t/plugging-tesla-k80-results-in-pci-resource-allocation-error/37007/10 https://sjgf.medium.com/a-tesla-k80-and-ubuntu-in-a-consumer-motherboard-ab0edbf0e0d1 https://www.reddit.com/r/nvidia/comments/mkyozp/comment/gtjqngi/ https://www.reddit.com/r/homelab/comments/g3zo9z/nvidia_tesla_k80_not_working_in_one_of_my_servers/ https://www.reddit.com/r/homelab/comments/12g3hkp/recently_purchased_a_tesla_k80/ https://www.reddit.com/r/homelab/comments/q88qa9/motherboard_large_bar_support/
The dasharo coreboot is doing Above 4G fine for me. This is actually the reason I am using this firmware. EDIT: But I am not familiar with IOMMUs and MMIO allocation and imagine the hardware IOMMU on the board may not have been designed for these cards. here's the tail of a conversation on large pci mapping from 2016: https://coreboot.coreboot.narkive.com/9o8wc1ym/discussion-about-dynamic-pci-mmio-size-on-x86#post16
The dasharo coreboot is doing Above 4G fine for me. This is actually the reason I am using this firmware.
How can you confirm this?
I had issues with 2 x Radeon 5600XT on the MSI, which is more than a full decade newer than yours. The problems manifest depending on OS (Windows vs Linux) and Linux Kernel version: https://github.com/Dasharo/dasharo-issues/issues/245
Linux seems to be able to pretty much fully reallocate PCI MMIO resources AFTER boot so you can get scenarios where Firmware does NOT support Above 4G / ReBAR out of the box but Linux takes care of enabling it on its side. I suppose than there should be Kernel parameters to make it honor Firmware allocated resources. My theory is that Linux is reallocating resources but fail to reconfigure the IOMMU with the new values so there is a mismatch.
I received the BAR errors you linked in the raptor firmware, and do not receive them on the dasharo. I loaded these devices with 40GB of data and store and accessed it reliably on them for days. But only with iommu=soft
so far. I hope to learn how to know these things better.
Would be great to have a coreboot log and kernel dmesg (you can remove any privacy-sensitive data as you please) if possible.
For the coreboot log you can gather it with another machine by connecting to KGPE's RS232 serial port exposed on the board during power on, or alternatively use cbmem utility from a running Linux on the KGPE-D16 like described here: https://docs.dasharo.com/common-coreboot-docs/dumping_logs/#cbmem-utility
Would be great to have a coreboot log and kernel dmesg (you can remove any privacy-sensitive data as you please) if possible.
For the coreboot log you can gather it with another machine by connecting to KGPE's RS232 serial port exposed on the board during power on, or alternatively use cbmem utility from a running Linux on the KGPE-D16 like described here: https://docs.dasharo.com/common-coreboot-docs/dumping_logs/#cbmem-utility
coreboot.log 2023-05-09T13:25:06+00:00.dmesg.log
I've paused updating this issue, as I've noticed the symptoms change depending on kernel version and ACSCtl flag on pci hubs. The transfers can be corrupt or intact and slow or fast or faster and I haven't figured out what depends on what yet. I believe the above files produce the output described in the original post.
For example, I think I found that if I keep data within the plx ~hub~ switch inside a single card, I can lose the corruption and gain speed if I disable its ACSCtl in linux 5.4, but if i upgrade to linux 6.1 this strangely no longer works.
Workaround for running K80s in parallel on the KGPE-D16:
cuda-drivers-470
and cuda-toolkit-11-4
and hold libnccl2
and libnccl-dev
at version 2.11.4-1+cuda-11.4
. NCCL needs to be built with the right cuda version to run, and manages inter-PCI communication.for dev in 22:08.0 22:10.0 a9:08.0 a9:10.0; do setpci -s $dev ecap_acs+6.w=0; done
. This gets high bandwidth between their onboard cards for me when tested with p2pBandwidthTest from the cuda examples.NCCL_P2P_LEVEL=PIX
. This tells NCCL to only use PCI P2P between cards that share the same immediate parent switch. This gets all_reduce_perf from the nccl examples to run without hanging or corrupting for me.
Dasharo version v0.4.0
Dasharo variant Asus KGPE-D16
Affected component(s) or functionality iommu?
Brief summary I am experiencing data corruption between PCI cards that resolves only when the kernel is booted with
iommu=soft
oriommu=off
. The issue limits the use of the system for AI/ML.How reproducible 100%
How to reproduce I am still working on narrowing the issue down to reduce the parts needed to reproduce, but only have my two Nvidia K80 cards to test IOMMU with at the moment.
Steps to reproduce the behavior:
python3 -m pip install torch
test.py
as follows:python3 test.py
Expected behavior
The array should hold the same data on every device.
Actual behavior
The array is emptied to zeros after being transferred to the second device.
Screenshots
Additional context
I can mail 1-2 K80 cards to somebody to help work this issue, since the cards are not as useful to me while it is present. Only one is needed to reproduce it. Together they can run machine learning models with up to 80 billion parameters if 4 bit quantization is used.
Here is a long thread on issues like this: https://github.com/pytorch/pytorch/issues/1637 it links to a test script which hangs with 100% gpu usage instead of showing corruption: https://gist.github.com/zou3519/f13145cafbb873a855ef524d6607125a
Here is a thread on this issue: https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748
Here is a short thread on this issue: https://github.com/pytorch/pytorch/issues/84803 Differences with short thread:
I meant to add more tools than the below, but below is what ended up here:
Solutions you've tried
Nvidia recommends disabling a feature called peer-to-peer. When I tried to disable this I did not see any change. Disabling iommu by passing a kernel parameter resolves the issue. Unfortunately this makes it prohibitively slow to transfer data between the cards. Disabling ACS with
setpci -s xx:yy.z ecap_acs+6.w=0
helps with the slowness.