Mellanox / nv_peer_memory

305 stars 61 forks source link

concurrent invalidation and tear-down can trigger a bug #53

Open drossetti opened 4 years ago

drossetti commented 4 years ago

Condition below is benign (see #15 ) so peer_err() below is incorrect and confusing. It should be removed.

nvidia_p2p_dma_unmap_pages
{
...
#if NV_DMA_MAPPING
        if (!nv_mem_context->dma_mapping) {
                peer_err("nv_get_p2p_free_callback -- invalid dma_mapping\n");
drossetti commented 11 months ago

Updating this issue after a long time. It turns out that the print is actually a signature of a bug in the way the MR are cleaned up, in specific conditions. The other relevant diagnostic is the one below:

nv_mem nv_get_p2p_free_callback:144 nv_get_p2p_free_callback -- invalid page_table 

Both may be related to this issue. The concerning case is when those checks are not able to mitigate the issue, because of the content of the memory is not zero.