Mellanox / nv_peer_memory

292 stars 60 forks source link

add support for new style peer memory client registration, fix a race #92

Closed drossetti closed 2 years ago

drossetti commented 2 years ago

With this change, this client registers itself as an extended client. This way it can opt into a new behavior, i.e. unmap during invalidation.

The Infiniband peer_mem kernel infrastructure reacts by avoiding calls to dma_unmap and put_pages client callbacks in the invalidation path, therefore not taking the internal lock, i.e. umem_p->mapping_lock. That avoids a lock inversion bug between the umem_p->mapping_lock and an internal NVIDIA GPU kernel-mode driver lock, tracked as 2696789 "Peer-direct patch may cause deadlock due to lock inversion".

Note that the change is compatible with older versions of the peer_mem patch, in the sense that the client registers successfully in all cases, but the lock inversion problem persists.

Bonus: fix a race in nv_mem_put_pages which has been around basically forever.

wlepera commented 2 years ago

The referenced MOFED bug 2696789 states the issue was found in MOFED 5.1-0.6.6.0. Does the deadlock issue also exist in the MOFED 4.9 LTS release? Is this PR needed for systems with pre 470 NV drivers and MOFED 4.9?

ferasd commented 2 years ago

MOFED4.9LTS does not have this bug