[GPU : 8ea]
NVIDIA A100-SXM4-80GB
Driver Version : 470.103.01
CUDA Version : 11.4
[IB : 8ea]
Ofed ver : OFED-5.6.0.1.6.1
nv_peer_mem : v1.0
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.32.1010
Hardware version: 0
Node GUID: 0x08c0eb0300c8ff40
System image GUID: 0x08c0eb0300c8ff40
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 173
LMC: 0
SM lid: 233
Capability mask: 0x2651e848
Port GUID: 0x08c0eb0300c8ff40
Link layer: InfiniBand
is this with github/nv_peer_mem or R470/nvidia-peermem?
Ofed ver : OFED-5.6.0.1.6.1 is not even available for download anymore. Should you not move to a 5.x LTS release?
Hello ~
This system occured unexpect reboot. I saw some logs before unexpected reboot in /var/log/syslog.
Dec 20 18:48:09 A100-42 kernel: nv_mem nv_get_p2p_free_callback:155 nv_get_p2p_free_callback -- invalid dma_mapping Dec 20 18:48:09 A100-42 kernel: nv_mem nv_get_p2p_free_callback:155 nv_get_p2p_free_callback -- invalid dma_mapping
What is these logs mean? Do that logs have relationship with unexpected reboot?
[ENV] OS: ubuntu 20.04 Kernel : 5.4.0-42-generic H/W : Supermicro AS-4124GO-NART (like DGX A100)
[GPU : 8ea] NVIDIA A100-SXM4-80GB Driver Version : 470.103.01 CUDA Version : 11.4
[IB : 8ea] Ofed ver : OFED-5.6.0.1.6.1 nv_peer_mem : v1.0 CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.32.1010 Hardware version: 0 Node GUID: 0x08c0eb0300c8ff40 System image GUID: 0x08c0eb0300c8ff40 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 173 LMC: 0 SM lid: 233 Capability mask: 0x2651e848 Port GUID: 0x08c0eb0300c8ff40 Link layer: InfiniBand
Thanks ~