Mellanox / nv_peer_memory

292 stars 60 forks source link

kernel: nv_mem nv_get_p2p_free_callback:155 nv_get_p2p_free_callback -- invalid dma_mapping #111

Open susol-hjkim opened 1 year ago

susol-hjkim commented 1 year ago

Hello ~

This system occured unexpect reboot. I saw some logs before unexpected reboot in /var/log/syslog.

Dec 20 18:48:09 A100-42 kernel: nv_mem nv_get_p2p_free_callback:155 nv_get_p2p_free_callback -- invalid dma_mapping Dec 20 18:48:09 A100-42 kernel: nv_mem nv_get_p2p_free_callback:155 nv_get_p2p_free_callback -- invalid dma_mapping

What is these logs mean? Do that logs have relationship with unexpected reboot?

[ENV] OS: ubuntu 20.04 Kernel : 5.4.0-42-generic H/W : Supermicro AS-4124GO-NART (like DGX A100)

[GPU : 8ea] NVIDIA A100-SXM4-80GB Driver Version : 470.103.01 CUDA Version : 11.4

[IB : 8ea] Ofed ver : OFED-5.6.0.1.6.1 nv_peer_mem : v1.0 CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.32.1010 Hardware version: 0 Node GUID: 0x08c0eb0300c8ff40 System image GUID: 0x08c0eb0300c8ff40 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 173 LMC: 0 SM lid: 233 Capability mask: 0x2651e848 Port GUID: 0x08c0eb0300c8ff40 Link layer: InfiniBand

Thanks ~

drossetti commented 1 year ago

is this with github/nv_peer_mem or R470/nvidia-peermem? Ofed ver : OFED-5.6.0.1.6.1 is not even available for download anymore. Should you not move to a 5.x LTS release?

drossetti commented 9 months ago

This may be due to #53