Closed drossetti closed 2 years ago
The referenced MOFED bug 2696789 states the issue was found in MOFED 5.1-0.6.6.0. Does the deadlock issue also exist in the MOFED 4.9 LTS release? Is this PR needed for systems with pre 470 NV drivers and MOFED 4.9?
MOFED4.9LTS does not have this bug
With this change, this client registers itself as an extended client. This way it can opt into a new behavior, i.e. unmap during invalidation.
The Infiniband peer_mem kernel infrastructure reacts by avoiding calls to dma_unmap and put_pages client callbacks in the invalidation path, therefore not taking the internal lock, i.e. umem_p->mapping_lock. That avoids a lock inversion bug between the umem_p->mapping_lock and an internal NVIDIA GPU kernel-mode driver lock, tracked as 2696789 "Peer-direct patch may cause deadlock due to lock inversion".
Note that the change is compatible with older versions of the peer_mem patch, in the sense that the client registers successfully in all cases, but the lock inversion problem persists.
Bonus: fix a race in nv_mem_put_pages which has been around basically forever.