Open VivekPanyam opened 2 years ago
Hi,
HMM functionality is not available because our kernel driver does not yet support it. We are working on that, but we routinely do not provide timelines or ETAs for things like that, sorry for the vagueness there.
Hi John!
Didn't expect you to respond directly :) When you say "our kernel driver," what are you referring to specifically? The code in this repo or a binary blob somewhere else?
Is there anything that can be done in the OSS codebase to accelerate HMM support or is it blocked by NVIDIA internal dependencies?
Thanks!
I'm referring to the code in this repo. In fact, maybe I should have written "kernel drivers", because both nvidia.ko and nvidia-uvm.ko are involved in supporting HMM.
As for accelerating development, nvidia-uvm.ko in particular is built from some very complex source code, due to the need to handle every aspect of the CUDA programming model. Adding production-quality HMM support to that simply takes time.
We realize that this is in demand and are working on it.
That makes sense. Is there a public issue tracking HMM support? If not, would you mind commenting on this issue when there's something publicly available to test?
Thanks again
I think that this issue might be the public issue that tracks HMM support. :)
Sure, I'll make a note to update this as part of the "release HMM support" steps.
Hi @johnhubbard!
There was a post on the NVIDIA tech blog yesterday (November 10th) that talks about HMM:
For PCIe-based platforms such as x86 or Arm, you can use the same Unified Memory programming model as the NVIDIA Grace Hopper model. That is possible through the Heterogeneous Memory Management (HMM) feature, which is a combination of Linux kernel features and NVIDIA driver features that use software to emulate memory coherence between CPUs and GPUs.
- https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/
It seems like HMM is still disabled in the 525.53
code drop from 15 hours ago: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/758b4ee8189c5198504cb1c3c5bc29027a9118a3/kernel-open/nvidia-uvm/uvm_hmm.c#L72-L88
Am I missing something?
Thanks!
That blog post was in error. After receiving your question here, we have corrected the blog to reflect that HMM is not yet supported in our driver. Thanks for alerting us, and sorry for the incorrect information that went out.
cc @sdake
Worth noting that the public alpha of this was (silently) pushed as part of r530. Seems to work ok w/ some testing so far.
Yes, an early version of HMM support is included in the r530 release. However, as I wrote here:
https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/kernel-open/nvidia-uvm/uvm_hmm.c#L42
, it is not ready for production use. That's why it was "silently" included. Once it is ready for production use, we will formally announce that (as per my Aug 3, 2022 comment here).
@johnhubbard can talk, if HMM support once ready for production use, will be enabled on closed source kernel driver also? I mean for pre Turing cards like Titan V (Volta).. also can talk about if Windows HMM support is planned eventually? (even if only on TCC mode or will come to WDMM mode also) thanks..
NVIDIA proprietary driver (530.30.02) using A30:
[12929.936750] Call Trace:
[12929.937530] __schedule+0x282/0x870
[12929.938611] ? kvm_sched_clock_read+0xd/0x20
[12929.939703] schedule+0x46/0xb0
[12929.940439] rwsem_down_write_slowpath+0x257/0x4d0
[12929.941548] ? __free_slab+0xcf/0x1d0
[12929.942426] uvm_perf_thrashing_stop+0x3d/0xa0 [nvidia_uvm]
[12929.943756] uvm_va_space_destroy+0xa4/0x480 [nvidia_uvm]
[12929.945018] uvm_release.constprop.0+0x93/0xc0 [nvidia_uvm]
[12929.946309] uvm_release_entry.part.0.isra.0+0x7a/0xb0 [nvidia_uvm]
[12929.947770] ? up+0x12/0x60
[12929.948428] ? __fput+0x100/0x240
[12929.949166] ? kmem_cache_free+0xff/0x420
[12929.949871] ? mntput_no_expire+0x47/0x270
[12929.950591] __fput+0x92/0x240
[12929.951134] task_work_run+0x62/0xa0
[12929.951792] do_exit+0x34b/0xa90
[12929.952367] ? __schedule+0x28a/0x870
[12929.953019] ? timerqueue_del+0x1e/0x50
[12929.953691] do_group_exit+0x33/0xa0
[12929.954325] get_signal+0x170/0x890
[12929.954942] arch_do_signal_or_restart+0xf1/0x7e0
[12929.955789] ? do_epoll_wait+0xd8/0x670
[12929.956468] ? hrtimer_interrupt+0x15d/0x2c0
[12929.957221] ? handle_irq_event+0x73/0xb0
[12929.957932] exit_to_user_mode_prepare+0xff/0x160
[12929.958586] syscall_exit_to_user_mode+0x28/0x150
[12929.959187] entry_SYSCALL_64_after_hwframe+0x61/0xc6
@sdake it was said was working using open kernel module only, but thanks for testing and confirming it doesn’t work on propietary (right now?).. as said earlier hope propietary kernel driver gets enabled also as unique way for pre turing cards (but I think HMM is Pascal+ only so only needed for Pascal and Volta generations)..
HMM depends upon the open source version of the driver. The open source version of the driver, in turn, only works on Turing and later GPUs.
As it says in the r530_00 release notes, Ch. 44, "The open flavor of kernel modules supports Turing, Ampere, and forward. The open kernel modules cannot support GPUs before Turing, because the open kernel modules depend on the GPU System Processor (GSP) first introduced in Turing."
Therefore, HMM is only available on Turing and later GPUs.
@johnhubbard thanks! so seems also no Windows support planned (even for WSL2), right? in any driver mode be either TCC or WDDM..
Right, no Windows support exists.
The HMM feature required OS kernel changes, in addition to changes in our driver stack here. The open source Linux kernel, and the kernel community, made it possible to make such changes.
On Windows, however, Microsoft has not made any such corresponding changes, so HMM is unavailable there.
@oscarbg all good. I reported the stack trace from the production driver available from NVIDIA's deb repos using: apt install cuda
using cuda 12.1
. I can say HMM
"functions" on the proprietary. driver, although its reliability is very poor, generating kernel stack traces often.
I will try the open-source kernel driver, and report kernel traces or other bad behavior here.
Does the GRID driver function with HMM
?
TY! -steve
thanks @sdake.. thanks @johnhubbard .. seems I have bad luck testing HMM with new Ada 4070 GPU and Nvidia 530.41.03 prebuilt open kernel module installed with sh ./NVIDIA-Linux-[...].run -m=kernel-open compiled sample and getting results=0 and
[ 4.618595] NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
[ 4.663159] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 530.41.03 Release Build (dvs-builder@U16-T02-35-3) Thu Mar 16 19:33:35 UTC 2023
[ 5.358387] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 530.41.03 Release Build (dvs-builder@U16-T02-35-3) Thu Mar 16 19:23:50 UTC 2023
[ 329.642495] NVRM: GPU at PCI:0000:01:00: GPU-b53502d1-facc-b12d-2a6c-cbb01b5beae4
[ 329.642499] NVRM: Xid (PCI:0000:01:00): 31, pid=7087, name=sm80, Ch 00000014, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x5560_dc4f4000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC
[ 330.484339] NVRM: Xid (PCI:0000:01:00): 31, pid=7105, name=sm80.out, Ch 00000014, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x55d5_7e958000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC
so not working.. it's because not building the kernel module by myself as prebuilt open kernel module doesn't enable HMM by default or because since early 530.xx beta 530.41.03 it's disabled? thanks..
HMM support is disabled by default in the r530 driver. That's why your sample is failing as shown above.
There is no need to recompile the kernel. You can set a driver load parameter to enable HMM. Look at modinfo, and /etc/modules. I don't have the exact commands as I am typing this on a phone.
Cheers Steve
@johnhubbard @sdake thanks both.. tested with the needed uvm module parameter and works! only open source kernel module works, as shared on this thread.. curious why closed source module also admits the same parameter but doesn’t work.. Concluding situation is little sad as closed source modules are needed if wanting to use Gsync right now for example.. so hoping for Gsync open source support soon (535.xx?) or HMM closed source support.. idealy both..
Yes, we're working hard to close the remaining feature gaps (such as Gsync ) in the open kernel modules. I can't promise particular releases, here, but yes: everything should ultimately converge in the open kernel modules.
May I ask your use case for HMM? For ML I am unconvinced there are not superior methods to manage memory. I was originally interested in ML use case. Also thanks for pointing out this is only within the open source drivers.
I am running A30s, A40s, A2s, and A100s. So when I tested with commercial drivers I saw no measurable bemefit - because commercial drivers are not yet enabled!
Thanks for your use case information.
Cheers Steve
On Thu, Apr 20, 2023 at 10:30 AM Andy Ritger @.***> wrote:
Yes, we're working hard to close the remaining feature gaps (such as Gsync ) in the open kernel modules. I can't promise particular releases, here, but yes: everything should ultimately converge in the open kernel modules.
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/open-gpu-kernel-modules/issues/338#issuecomment-1516698272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYRCJY7CMW76L5EW4OOVLXCFXDFANCNFSM547PNNRA . You are receiving this because you were mentioned.Message ID: @.***>
@aritger thanks for information.. @sdake really experiencing for fun.. no real use case right now..
Has anyone benched this approach for ML workloads versus,say, Microsoft's awesome work with DeepSpeed? This feels like a solution seeking a problem.
Thank you, Steve
HMM is now supported with CUDA 12.2. Please see https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/ for more information.
@johnhubbard I'll just ask this simply:
What does it take to have HMM enabled and nvidia-smi -q
to show Addressing Mode : HMM
on an HGX-H100 x86 system running Ubuntu 22.04 with the 5.15.0-105-generic
kernel?
Driver Version: 550.90.07 CUDA Version: 12.4
Are there any BIOS settings or kernel params to be passed via GRUB that are not documented anywhere?
edit: Going to try and upgrade the kernel to linux-image-6.8.0-39-generic
based on the blog stating:
A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.
@bhaveshdavda there is not a simple answer, unfortunately.
A working kernel.org is in this repository as a dockerfile which you can build locally. After running build.sh
, the target
directory will then contain 4 .deb
files you can co-install.
https://github.com/artificialwisdomai/origin
Please let me know how it goes.
Thanks -steve
Update. I finally got this working in a Kubernetes environment no less with the NVIDIA GPU Operator. Notes:
uvm_ats_mode=0
to the nvidia-uvm
moduleIt isn't necessary to upgrade the kernel, but instead, it is necessary to configure the one you have properly. There is one additional config option required. I didn't need uvm_ats_mode=0
, and I am not sure why you would want to turn off address translation service as it protects the platform's DMA operations from third party corruption.
There is a Docker to build Debian upstream kernel (should work fine with Ubuntu as well) here:
https://github.com/artificialwisdomai/origin/tree/main/platform/Dockerfiles/linux-kernel
@sdake I agree with your statement about disabling ATS not being required and I too assumed ATS is an important security feature for PCIe. And I also feel like the stock Ubuntu 22.04 LTS kernel 5.15.0-105-generic
would probably also work as it has the right Kconfig. I think the main variable is using the open kernel module instead of the proprietary module because the nvidia-uvm
module has #ifdef'ed out implementation of uvm-hmm.c
in the latter
Edit:
uvm_ats_mode=0
is not required5.15.0-105-generic
kernel definitely doesn't work, but stock Ubuntu 22.04 6.8.0-39-generic
does
NVIDIA Open GPU Kernel Modules Version
515.57 Release
Does this happen with the proprietary driver (of the same version) as well?
Yes
Operating System and Version
Ubuntu 20.04.1 LTS
Kernel Release
5.13.0-1029-aws
Hardware: GPU
NVIDIA T4
Describe the bug
Hello!
HMM support has been mentioned in several NVIDIA docs and presentations since 2017 (including the announcement of the open-source kernel modules), but it seems to be disabled here (and doesn't work when using the proprietary driver).
https://github.com/NVIDIA/open-gpu-kernel-modules/blob/d8f3bcff924776518f1e63286537c3cf365289ac/kernel-open/nvidia-uvm/uvm_hmm.h#L35-L50
I assume the referenced bug/task is internal. Is there any information you can share on what additional work needs to happen to enable UVM-HMM (or potentially a timeline?).
See references and a repro below.
To Reproduce
I'm testing HMM using the following code (from one of the presentations linked below):
It currently just prints
Results: 0
and the following message is in the output ofdmesg
Bug Incidence
Always
nvidia-bug-report.log.gz
N/A
More Info
References: