ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
324 stars 99 forks source link

Failing to cleanup kfd_process information in sysfs #114

Closed morrone closed 1 month ago

morrone commented 3 years ago

We have noticed that the process directories under /sys/devices/virtual/kfd/kfd/proc are never being cleaned up. For instance, after a run of "rocm-bandwidth-test", the related process's directory under /sys/devices/virtual/kfd/kfd/proc stays around forever.

We are using the rocm 4.2.0 driver against a 4.18.0 kernel.

The code is using the mmu_notifier_put() strategy.

In debugging with systemtap, it would appear that kfd_process_notifier_release() is being called, but there appears to be no call to kfd_process_free_notifier().

I also am detecting no call to kfd_process_wq_release() using systemtap, and that would appear to be where sysfs_remove_file() would be called.

Are we expecting kfd_process_notifier_release() to be called before kfd_process_free_notifier()?

Is it our expectation that the mmu_notifier_put() in kfd_process_notifier_release() should allow kfd_process_free_notifier() to later be triggered, allowing the final kfd_unref_process()?

fxkamd commented 3 years ago

4.18 sounds like you're using RHEL 8 or CentOS 8. There is a workaround for a bug in the RHEL/CentOS 8.3 kernel in ROCm 4.2. Is that somehow not working for you? Or are you using a different RHEL/CentOS version that is not covered by this workaround?

commit 51c9f2dca839dce8eb599af86054a62035e09809 Author: Felix Kuehling Felix.Kuehling@amd.com Date: Wed Jan 20 14:29:34 2021 +0800

drm/amdkcl: Work around mmu_notifier_put issue on RHEL 8.3

The DRM backport from kernel 5.6 includes some MMU notifier changes
that cause problems with the mmu_notifier_put function. The
free_notifier never gets called. This leads to a leak of kfd_process
structures and their doorbells.

Work around this by falling back to the old method of releasing the
MMU notifier and destryoing the process structure.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
morrone commented 3 years ago

This is RHEL based, but for the TOSS OS. We have a patched kernel, and it looks like we are taking the RHEL rock dkms rpm and reworking it into a statically built kmod rpm.

I suspect that something in our stack doesn't let that patch undef HAVE_MMU_NOTIFIER_PUT, because it certainly looks like our end product was compiled to use mmu_notifier_put() rather than the alternate method.

Thanks, this helps alot! Now I can stop trying to fix mmu notification and just focus on making it build to use the alternate method.

morrone commented 3 years ago

Here are the versions on our system:

{noformat} [ 130.804315] [drm] amdgpu version: 5.9.25 [ 130.808273] [drm] OS DRM version: 5.9.0 {noformat}

Commit 51c9f2d checks for DRM_PATCH == 6, so that is almost certainly why it doesn't drop undef HAVE_MMU_NOTIFIER_PUT for us at drm patch level 9. That is easy enough to patch and test on our side.

ppanchad-amd commented 1 month ago

@morrone Do you still need assistance with this ticket? If not, please close the ticket. Thanks!