aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
133 stars 52 forks source link

PyTorch Distributed Training crashes with "Cannot allocate memory (-12)" #141

Closed airsplay closed 1 year ago

airsplay commented 1 year ago

Hi All,

I am running PyTorch distributed training (code here) on 4 AWS A100 nodes with EFA. We got an error of

Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683

when launching the experiments. Using ethernet gives correct results thus we think that the issue happens in EFA. Could you please take a look on it?

Here are the library versions:

PyTorch 1.12.0
CUDA: 11.6
NCCL: 2.10.3
aws-ofi-nccl: 1.3.0
libfabric: libfabric.so.1.19.0
cuda-drivers-fabricmanager-510
cuda-drivers-510

Full log before crashes:

ip-10-216-179-193:964326:964326 [0] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964326:964326 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.6
ip-10-216-179-193:964328:964328 [2] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964332:964332 [6] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964331:964331 [5] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964327:964327 [1] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964330:964330 [4] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964329:964329 [3] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964333:964333 [7] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964328:964328 [2] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964331:964331 [5] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964330:964330 [4] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964332:964332 [6] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964327:964327 [1] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964329:964329 [3] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964333:964333 [7] NCCL INFO Using network AWS Libfabric
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Traceback (most recent call last):
  File "main.py", line 177, in <module>
    devjobs_multi_node_main(args)
  File "main.py", line 77, in devjobs_multi_node_main
    mp.spawn(main, nprocs=args.gpus, args=(args,))
  File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
 line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
 line 198, in start_processes
    while not context.join():
  File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
 line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

Let me know if any other information or some other tests can help. Thanks!

Best, Hao

rashikakheria commented 1 year ago

Hey Hao,

Thanks for reporting the issue.

Could you provide the following information:

  1. Step by step instructions and the command line you use to run the training?
  2. Provide detailed logs with -x FI_LOG_LEVEL=warn -x FI_LOG_PROV=efa?
  3. Output of fi_info -p efa -t FI_EP_RDM and lspci -i efa?
  4. EFA installer version? The reason if that libfabric only has versions upto v1.15.1.
taruntandon88 commented 1 year ago

Hi Rashika,

I am working with Hao to troubleshoot this. Steps to repo this problem

  1. Install EFA using https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html on 2 instances of p4dn.24xlarge with kernel version 5.13.0-1023-aws and using CUDA 11.6
  2. Download the distributed training code from https://github.com/pytorch/examples/tree/main/distributed/ddp
  3. Command on node1
    FI_PROVIDER="efa" FI_EFA_USE_DEVICE_RDMA=1 NCCL_DEBUG=INFO FI_LOG_LEVEL=warn FI_LOG_PROV=efa LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
    python <path_to_launch.py> \
    --nnode=2 --node_rank=0 --nproc_per_node=8 --master_addr="10.216.179.193" --master_port=35000 \
    <path_to_example.py> --local_world_size=8
  4. Command on node2
    FI_PROVIDER="efa" FI_EFA_USE_DEVICE_RDMA=1 NCCL_DEBUG=INFO FI_LOG_LEVEL=warn FI_LOG_PROV=efa LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
    python <path_to_launch.py> \
    --nnode=2 --node_rank=1 --nproc_per_node=8 --master_addr="10.216.179.193" --master_port=35000 \
    <path_to_example.py> --local_world_size=8

    Output is

    p-10-216-179-87:718885:719039 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718885:1658904522::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    libfabric:718885:1658904522::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718885:719039 [0] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Using network AWS Libfabric
    ip-10-216-179-87:718890:718890 [5] NCCL INFO cudaDriverVersion 11060
    ip-10-216-179-87:718892:718892 [7] NCCL INFO cudaDriverVersion 11060
    ip-10-216-179-87:718891:718891 [6] NCCL INFO cudaDriverVersion 11060
    ip-10-216-179-87:718889:718889 [4] NCCL INFO cudaDriverVersion 11060
    ip-10-216-179-87:718887:718887 [2] NCCL INFO cudaDriverVersion 11060
    ip-10-216-179-87:718890:718890 [5] NCCL INFO Bootstrap : Using ens32:10.216.179.87<0>
    ip-10-216-179-87:718890:718890 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
    ip-10-216-179-87:718890:718890 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
    ip-10-216-179-87:718890:719040 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
    ip-10-216-179-87:718890:719040 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    ip-10-216-179-87:718890:719040 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718890:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    libfabric:718890:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718890:719040 [5] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718890:719040 [5] NCCL INFO Using network AWS Libfabric
    ip-10-216-179-87:718892:718892 [7] NCCL INFO Bootstrap : Using ens32:10.216.179.87<0>
    ip-10-216-179-87:718892:718892 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
    ip-10-216-179-87:718892:718892 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
    ip-10-216-179-87:718892:719041 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
    ip-10-216-179-87:718892:719041 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    ip-10-216-179-87:718892:719041 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718892:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    libfabric:718892:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718892:719041 [7] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718892:719041 [7] NCCL INFO Using network AWS Libfabric
    ip-10-216-179-87:718891:718891 [6] NCCL INFO Bootstrap : Using ens32:10.216.179.87<0>
    ip-10-216-179-87:718891:718891 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
    ip-10-216-179-87:718891:718891 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
    ip-10-216-179-87:718891:719042 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
    ip-10-216-179-87:718891:719042 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    ip-10-216-179-87:718891:719042 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718891:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718886:718886 [1] NCCL INFO cudaDriverVersion 11060
    libfabric:718891:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718891:719042 [6] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718891:719042 [6] NCCL INFO Using network AWS Libfabric
    ip-10-216-179-87:718889:718889 [4] NCCL INFO Bootstrap : Using ens32:10.216.179.87<0>
    ip-10-216-179-87:718889:718889 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
    ip-10-216-179-87:718889:718889 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
    ip-10-216-179-87:718889:719043 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
    ip-10-216-179-87:718889:719043 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    ip-10-216-179-87:718889:719043 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718889:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718887:718887 [2] NCCL INFO Bootstrap : Using ens32:10.216.179.87<0>
    ip-10-216-179-87:718887:718887 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
    ip-10-216-179-87:718887:718887 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
    ip-10-216-179-87:718887:719044 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
    ip-10-216-179-87:718887:719044 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    ip-10-216-179-87:718887:719044 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718887:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    libfabric:718889:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718889:719043 [4] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718889:719043 [4] NCCL INFO Using network AWS Libfabric
    libfabric:718887:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718887:719044 [2] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718887:719044 [2] NCCL INFO Using network AWS Libfabric
    ip-10-216-179-87:718886:718886 [1] NCCL INFO Bootstrap : Using ens32:10.216.179.87<0>
    ip-10-216-179-87:718886:718886 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
    ip-10-216-179-87:718886:718886 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
    ip-10-216-179-87:718886:719045 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
    ip-10-216-179-87:718886:719045 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    ip-10-216-179-87:718886:719045 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718886:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718888:718888 [3] NCCL INFO cudaDriverVersion 11060
    libfabric:718886:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718886:719045 [1] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718886:719045 [1] NCCL INFO Using network AWS Libfabric
    ip-10-216-179-87:718888:718888 [3] NCCL INFO Bootstrap : Using ens32:10.216.179.87<0>
    ip-10-216-179-87:718888:718888 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
    ip-10-216-179-87:718888:718888 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
    ip-10-216-179-87:718888:719046 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
    ip-10-216-179-87:718888:719046 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
    ip-10-216-179-87:718888:719046 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
    libfabric:718888:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    libfabric:718888:1658904523::core:core:cuda_gdrcopy_hmem_init():191<warn> gdrcopy_dl_hmem_init failed!
    ip-10-216-179-87:718888:719046 [3] NCCL INFO NET/OFI Selected Provider is efa
    ip-10-216-179-87:718888:719046 [3] NCCL INFO Using network AWS Libfabric
    ip-10-216-179-87:718887:719044 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
    ip-10-216-179-87:718888:719046 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
    ip-10-216-179-87:718889:719043 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
    ip-10-216-179-87:718891:719042 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
    ip-10-216-179-87:718890:719040 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
    ip-10-216-179-87:718886:719045 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
    ip-10-216-179-87:718892:719041 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
    ip-10-216-179-87:718891:719042 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->6 [3] 15/-1/-1->14->13 [4] 15/-1/-1->14->13 [5] 15/6/-1->14->-1
    ip-10-216-179-87:718890:719040 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] -1/-1/-1->13->12 [3] 14/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] -1/-1/-1->13->12
    ip-10-216-179-87:718892:719041 [7] NCCL INFO Trees [0] 8/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] 8/-1/-1->15->14 [4] 8/-1/-1->15->14 [5] 8/-1/-1->15->14
    ip-10-216-179-87:718886:719045 [1] NCCL INFO Trees [0] -1/-1/-1->9->8 [1] 10/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] -1/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] 10/-1/-1->9->8
    ip-10-216-179-87:718887:719044 [2] NCCL INFO Trees [0] 11/-1/-1->10->2 [1] 11/-1/-1->10->9 [2] 11/-1/-1->10->9 [3] 11/2/-1->10->-1 [4] 11/-1/-1->10->9 [5] 11/-1/-1->10->9
    ip-10-216-179-87:718888:719046 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] -1/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] -1/-1/-1->11->10 [5] 12/-1/-1->11->10
    ip-10-216-179-87:718889:719043 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->4 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11 [4] 13/4/-1->12->-1 [5] 13/-1/-1->12->11
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Trees [0] 9/-1/-1->8->15 [1] 9/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/-1/-1->8->15 [5] 9/-1/-1->8->15
    ip-10-216-179-87:718887:719044 [2] NCCL INFO Channel 00 : 10[201c0] -> 15[a01d0] via P2P/IPC/read
    ip-10-216-179-87:718889:719043 [4] NCCL INFO Channel 02 : 12[901c0] -> 15[a01d0] via P2P/IPC/read
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Channel 00 : 8[101c0] -> 11[201d0] via P2P/IPC/read
    ip-10-216-179-87:718887:719044 [2] NCCL INFO Channel 03 : 10[201c0] -> 15[a01d0] via P2P/IPC/read
    ip-10-216-179-87:718889:719043 [4] NCCL INFO Channel 05 : 12[901c0] -> 15[a01d0] via P2P/IPC/read
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Channel 03 : 8[101c0] -> 11[201d0] via P2P/IPC/read
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Channel 02 : 8[101c0] -> 13[901d0] via P2P/IPC/read
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Channel 05 : 8[101c0] -> 13[901d0] via P2P/IPC/read
    ip-10-216-179-87:718888:719046 [3] NCCL INFO Channel 00/0 : 11[201d0] -> 2[201c0] [send] via NET/AWS Libfabric/0/GDRDMA
    ip-10-216-179-87:718888:719046 [3] NCCL INFO Channel 03/0 : 11[201d0] -> 2[201c0] [send] via NET/AWS Libfabric/0/GDRDMA
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Channel 01 : 8[101c0] -> 15[a01d0] via P2P/IPC/read
    ip-10-216-179-87:718885:719039 [0] NCCL INFO Channel 04 : 8[101c0] -> 15[a01d0] via P2P/IPC/read
    ip-10-216-179-87:718890:719040 [5] NCCL INFO Channel 01/0 : 13[901d0] -> 4[901c0] [send] via NET/AWS Libfabric/1/GDRDMA
    ip-10-216-179-87:718890:719040 [5] NCCL INFO Channel 04/0 : 13[901d0] -> 4[901c0] [send] via NET/AWS Libfabric/1/GDRDMA
    ip-10-216-179-87:718892:719041 [7] NCCL INFO Channel 02/0 : 15[a01d0] -> 6[a01c0] [send] via NET/AWS Libfabric/2/GDRDMA
    ip-10-216-179-87:718892:719041 [7] NCCL INFO Channel 05/0 : 15[a01d0] -> 6[a01c0] [send] via NET/AWS Libfabric/2/GDRDMA
    ip-10-216-179-87:718891:719042 [6] NCCL INFO Channel 02/0 : 7[a01d0] -> 14[a01c0] [receive] via NET/AWS Libfabric/2/GDRDMA
    ip-10-216-179-87:718891:719042 [6] NCCL INFO Channel 05/0 : 7[a01d0] -> 14[a01c0] [receive] via NET/AWS Libfabric/2/GDRDMA
    ip-10-216-179-87:718889:719043 [4] NCCL INFO Channel 01/0 : 5[901d0] -> 12[901c0] [receive] via NET/AWS Libfabric/1/GDRDMA
    ip-10-216-179-87:718889:719043 [4] NCCL INFO Channel 04/0 : 5[901d0] -> 12[901c0] [receive] via NET/AWS Libfabric/1/GDRDMA
    ip-10-216-179-87:718887:719044 [2] NCCL INFO Channel 00/0 : 3[201d0] -> 10[201c0] [receive] via NET/AWS Libfabric/0/GDRDMA
    ip-10-216-179-87:718887:719044 [2] NCCL INFO Channel 03/0 : 3[201d0] -> 10[201c0] [receive] via NET/AWS Libfabric/0/GDRDMA
    libfabric:718891:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718891:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718891:1658904526::efa:cq:rxr_ep_grow_rx_pkt_pools():1585<warn> cannot allocate memory for EFA's RX packet pool. error: Cannot allocate memory
    libfabric:718891:1658904526::efa:eq:efa_eq_write_error():662<warn> Writing error Cannot allocate memory to EQ.
    libfabric:718891:1658904526::efa:eq:efa_eq_write_error():676<warn> Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12)
    Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
    libfabric:718889:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718889:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718889:1658904526::efa:cq:rxr_ep_grow_rx_pkt_pools():1585<warn> cannot allocate memory for EFA's RX packet pool. error: Cannot allocate memory
    libfabric:718889:1658904526::efa:eq:efa_eq_write_error():662<warn> Writing error Cannot allocate memory to EQ.
    libfabric:718889:1658904526::efa:eq:efa_eq_write_error():676<warn> Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12)
    Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
    libfabric:718887:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718887:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718887:1658904526::efa:cq:rxr_ep_grow_rx_pkt_pools():1585<warn> cannot allocate memory for EFA's RX packet pool. error: Cannot allocate memory
    libfabric:718887:1658904526::efa:eq:efa_eq_write_error():662<warn> Writing error Cannot allocate memory to EQ.
    libfabric:718887:1658904526::efa:eq:efa_eq_write_error():676<warn> Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12)
    Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
    libfabric:718892:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718892:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718890:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718890:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718888:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718888:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718892:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718892:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718892:1658904526::efa:cq:rxr_ep_grow_rx_pkt_pools():1585<warn> cannot allocate memory for EFA's RX packet pool. error: Cannot allocate memory
    libfabric:718892:1658904526::efa:eq:efa_eq_write_error():662<warn> Writing error Cannot allocate memory to EQ.
    libfabric:718892:1658904526::efa:eq:efa_eq_write_error():676<warn> Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12)
    Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
    libfabric:718890:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718890:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718890:1658904526::efa:cq:rxr_ep_grow_rx_pkt_pools():1585<warn> cannot allocate memory for EFA's RX packet pool. error: Cannot allocate memory
    libfabric:718890:1658904526::efa:eq:efa_eq_write_error():662<warn> Writing error Cannot allocate memory to EQ.
    libfabric:718890:1658904526::efa:eq:efa_eq_write_error():676<warn> Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12)
    Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
    libfabric:718888:1658904526::efa:mr:efa_mr_reg_impl():517<warn> Unable to register MR: Cannot allocate memory
    libfabric:718888:1658904526::efa:mr:efa_mr_regattr():632<warn> Unable to register MR: Cannot allocate memory
    libfabric:718888:1658904526::efa:cq:rxr_ep_grow_rx_pkt_pools():1585<warn> cannot allocate memory for EFA's RX packet pool. error: Cannot allocate memory
    libfabric:718888:1658904526::efa:eq:efa_eq_write_error():662<warn> Writing error Cannot allocate memory to EQ.
    libfabric:718888:1658904526::efa:eq:efa_eq_write_error():676<warn> Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12)
    Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno: Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683

The output of $ fi_info -p efa -t FI_EP_RDM is

provider: efa
    fabric: efa
    domain: rdmap32s27-rdm
    version: 116.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap144s27-rdm
    version: 116.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap160s27-rdm
    version: 116.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA

The output of lspci -i efa is

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
10:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
10:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
10:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:01.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
20:1b.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)
20:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
80:1a.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1b.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1c.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1d.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1e.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1f.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
90:01.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
90:1b.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)
90:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
90:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:01.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
a0:1b.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)
a0:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller

The output of sudo cat /opt/amazon/efa_installed_packages is

# EFA installer version: 1.17.2
# Debug packages installed: no
# Packages installed:
efa-config_1.10_all efa-profile_1.5_all libfabric-aws-bin_1.16.0~amzn3.0_amd64 libfabric-aws-dev_1.16.0~amzn3.0_amd64 libfabric1-aws_1.16.0~amzn3.0_amd64 openmpi40-aws_4.1.4-1_amd64 ibacm_41.0-1_amd64 ibverbs-providers_41.0-1_amd64 ibverbs-utils_41.0-1_amd64 infiniband-diags_41.0-1_amd64 libibmad-dev_41.0-1_amd64 libibmad5_41.0-1_amd64 libibnetdisc-dev_41.0-1_amd64 libibnetdisc5_41.0-1_amd64 libibumad-dev_41.0-1_amd64 libibumad3_41.0-1_amd64 libibverbs-dev_41.0-1_amd64 libibverbs1_41.0-1_amd64 librdmacm-dev_41.0-1_amd64 librdmacm1_41.0-1_amd64 rdma-core_41.0-1_amd64 rdmacm-utils_41.0-1_amd64 efa_1.16.0-1.amzn1_amd64

It looks like even though EFA installer 1.17.2 is supposed to install libfabric 1.16.0, in the folder /opt/amazon/efa/lib, what get's installed is libfabric.so.1.19.0

taruntandon88 commented 1 year ago

One additional thing is that when we run nccl-tests across these multiple instances, we don't see any errors and everything passes.

/opt/amazon/openmpi/bin/mpirun \
    -x FI_PROVIDER="efa" \
    -x FI_EFA_USE_DEVICE_RDMA=1 \
    -x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
    -x NCCL_DEBUG=INFO \
    --hostfile my-hosts -n 32 -N 8 \
    --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
    $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
...
...
..
ip-10-216-179-193:641226:641269 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-216-179-193:641226:641269 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:641220:641267 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:641220:641267 [1] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:641224:641268 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:641224:641268 [5] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:641226:641269 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:641226:641269 [7] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:641223:641223 [4] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
...
...
...
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
ip-10-216-179-193:641221:641271 [2] NCCL INFO comm 0x7f115c000c60 rank 2 nranks 32 cudaDev 2 busId 201c0 - Init COMPLETE
ip-10-216-178-255:416486:416524 [7] NCCL INFO comm 0x7ff6bc000c60 rank 23 nranks 32 cudaDev 7 busId a01d0 - Init COMPLETE
ip-10-216-178-255:416483:416523 [5] NCCL INFO comm 0x7f1720000c60 rank 21 nranks 32 cudaDev 5 busId 901d0 - Init COMPLETE
ip-10-216-178-255:416479:416530 [1] NCCL INFO comm 0x7fb05c000c60 rank 17 nranks 32 cudaDev 1 busId 101d0 - Init COMPLETE
ip-10-216-178-255:416482:416525 [4] NCCL INFO comm 0x7f7998000c60 rank 20 nranks 32 cudaDev 4 busId 901c0 - Init COMPLETE
ip-10-216-178-255:416481:416522 [3] NCCL INFO comm 0x7f586c000c60 rank 19 nranks 32 cudaDev 3 busId 201d0 - Init COMPLETE
ip-10-216-178-255:416478:416527 [0] NCCL INFO comm 0x7ff8f8000c60 rank 16 nranks 32 cudaDev 0 busId 101c0 - Init COMPLETE
ip-10-216-178-255:416480:416529 [2] NCCL INFO comm 0x7fd66c000c60 rank 18 nranks 32 cudaDev 2 busId 201c0 - Init COMPLETE
ip-10-216-178-255:416484:416528 [6] NCCL INFO comm 0x7f191c000c60 rank 22 nranks 32 cudaDev 6 busId a01c0 - Init COMPLETE
           8             2     float     sum    167.0    0.00    0.00  2e-07    161.7    0.00    0.00  2e-07
          16             4     float     sum    162.9    0.00    0.00  2e-07    162.8    0.00    0.00  2e-07
          32             8     float     sum    163.2    0.00    0.00  2e-07    162.9    0.00    0.00  2e-07
          64            16     float     sum    163.7    0.00    0.00  2e-07    164.0    0.00    0.00  2e-07
         128            32     float     sum    164.3    0.00    0.00  2e-07    164.0    0.00    0.00  2e-07
         256            64     float     sum    165.6    0.00    0.00  2e-07    165.2    0.00    0.00  2e-07
         512           128     float     sum    169.3    0.00    0.01  2e-07    169.1    0.00    0.01  1e-07
        1024           256     float     sum    175.0    0.01    0.01  5e-07    175.3    0.01    0.01  5e-07
        2048           512     float     sum    182.2    0.01    0.02  5e-07    181.6    0.01    0.02  5e-07
        4096          1024     float     sum    193.1    0.02    0.04  5e-07    193.1    0.02    0.04  5e-07
        8192          2048     float     sum    210.0    0.04    0.08  5e-07    194.4    0.04    0.08  5e-07
       16384          4096     float     sum    212.0    0.08    0.15  5e-07    195.3    0.08    0.16  5e-07
       32768          8192     float     sum    391.8    0.08    0.16  5e-07    382.1    0.09    0.17  5e-07
       65536         16384     float     sum    392.4    0.17    0.32  7e-07    421.1    0.16    0.30  7e-07
      131072         32768     float     sum    406.8    0.32    0.62  7e-07    468.9    0.28    0.54  7e-07
      262144         65536     float     sum    413.8    0.63    1.23  7e-07    417.3    0.63    1.22  7e-07
      524288        131072     float     sum    518.2    1.01    1.96  7e-07    524.6    1.00    1.94  7e-07
     1048576        262144     float     sum    721.1    1.45    2.82  7e-07    719.4    1.46    2.82  7e-07
     2097152        524288     float     sum    808.1    2.60    5.03  7e-07    806.9    2.60    5.04  7e-07
     4194304       1048576     float     sum    956.5    4.39    8.50  7e-07    953.2    4.40    8.53  7e-07
     8388608       2097152     float     sum   1446.7    5.80   11.23  7e-07   1453.5    5.77   11.18  7e-07
    16777216       4194304     float     sum   2436.3    6.89   13.34  7e-07   2437.5    6.88   13.34  7e-07
    33554432       8388608     float     sum   3616.4    9.28   17.98  7e-07   3645.9    9.20   17.83  7e-07
    67108864      16777216     float     sum   5298.8   12.66   24.54  1e-06   5248.0   12.79   24.78  1e-06
   134217728      33554432     float     sum    11120   12.07   23.39  1e-06    11208   11.98   23.20  1e-06
   268435456      67108864     float     sum    19494   13.77   26.68  1e-06    19551   13.73   26.60  1e-06
   536870912     134217728     float     sum    37906   14.16   27.44  1e-06    37963   14.14   27.40  1e-06
  1073741824     268435456     float     sum    70783   15.17   29.39  1e-06    70790   15.17   29.39  1e-06
ip-10-216-179-193:641220:641220 [1] NCCL INFO comm 0x7faea0000c60 rank 1 nranks 32 cudaDev 1 busId 101d0 - Destroy COMPLETE
ip-10-216-179-119:416602:416602 [1] NCCL INFO comm 0x7f8328000c60 rank 25 nranks 32 cudaDev 1 busId 101d0 - Destroy COMPLETE
ip-10-216-179-193:641224:641224 [5] NCCL INFO comm 0x7f9a40000c60 rank 5 nranks 32 cudaDev 5 busId 901d0 - Destroy COMPLETE
ip-10-216-179-87:417132:417132 [1] NCCL INFO comm 0x7f6738000c60 rank 9 nranks 32 cudaDev 1 busId 101d0 - Destroy COMPLETE
ip-10-216-179-87:417136:417136 [5] NCCL INFO comm 0x7f0fd8000c60 rank 13 nranks 32 cudaDev 5 busId 901d0 - Destroy COMPLETE
ip-10-216-179-119:416606:416606 [5] NCCL INFO comm 0x7fb640000c60 rank 29 nranks 32 cudaDev 5 busId 901d0 - Destroy COMPLETE
ip-10-216-178-255:416479:416479 [1] NCCL INFO comm 0x7fb05c000c60 rank 17 nranks 32 cudaDev 1 busId 101d0 - Destroy COMPLETE
ip-10-216-179-193:641222:641222 [3] NCCL INFO comm 0x7f031c000c60 rank 3 nranks 32 cudaDev 3 busId 201d0 - Destroy COMPLETE
ip-10-216-179-193:641219:641219 [0] NCCL INFO comm 0x7f94f0000c60 rank 0 nranks 32 cudaDev 0 busId 101c0 - Destroy COMPLETE
ip-10-216-179-119:416601:416601 [0] NCCL INFO comm 0x7efd4c000c60 rank 24 nranks 32 cudaDev 0 busId 101c0 - Destroy COMPLETE
ip-10-216-179-119:416604:416604 [3] NCCL INFO comm 0x7f8a5c000c60 rank 27 nranks 32 cudaDev 3 busId 201d0 - Destroy COMPLETE
ip-10-216-179-119:416608:416608 [7] NCCL INFO comm 0x7f5330000c60 rank 31 nranks 32 cudaDev 7 busId a01d0 - Destroy COMPLETE
ip-10-216-179-193:641226:641226 [7] NCCL INFO comm 0x7ff018000c60 rank 7 nranks 32 cudaDev 7 busId a01d0 - Destroy COMPLETE
ip-10-216-178-255:416483:416483 [5] NCCL INFO comm 0x7f1720000c60 rank 21 nranks 32 cudaDev 5 busId 901d0 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.95596 
#
ip-10-216-179-193:641225:641225 [6] NCCL INFO comm 0x7fc938000c60 rank 6 nranks 32 cudaDev 6 busId a01c0 - Destroy COMPLETE
ip-10-216-178-255:416478:416478 [0] NCCL INFO comm 0x7ff8f8000c60 rank 16 nranks 32 cudaDev 0 busId 101c0 - Destroy COMPLETE
ip-10-216-179-87:417131:417131 [0] NCCL INFO comm 0x7f51ec000c60 rank 8 nranks 32 cudaDev 0 busId 101c0 - Destroy COMPLETE
ip-10-216-179-87:417134:417134 [3] NCCL INFO comm 0x7f40ec000c60 rank 11 nranks 32 cudaDev 3 busId 201d0 - Destroy COMPLETE
ip-10-216-179-119:416607:416607 [6] NCCL INFO comm 0x7f477c000c60 rank 30 nranks 32 cudaDev 6 busId a01c0 - Destroy COMPLETE
ip-10-216-178-255:416481:416481 [3] NCCL INFO comm 0x7f586c000c60 rank 19 nranks 32 cudaDev 3 busId 201d0 - Destroy COMPLETE
ip-10-216-179-119:416603:416603 [2] NCCL INFO comm 0x7fd5ac000c60 rank 26 nranks 32 cudaDev 2 busId 201c0 - Destroy COMPLETE
ip-10-216-179-193:641223:641223 [4] NCCL INFO comm 0x7f006c000c60 rank 4 nranks 32 cudaDev 4 busId 901c0 - Destroy COMPLETE
ip-10-216-179-193:641221:641221 [2] NCCL INFO comm 0x7f115c000c60 rank 2 nranks 32 cudaDev 2 busId 201c0 - Destroy COMPLETE
ip-10-216-179-87:417138:417138 [7] NCCL INFO comm 0x7f0a9c000c60 rank 15 nranks 32 cudaDev 7 busId a01d0 - Destroy COMPLETE
ip-10-216-178-255:416486:416486 [7] NCCL INFO comm 0x7ff6bc000c60 rank 23 nranks 32 cudaDev 7 busId a01d0 - Destroy COMPLETE
ip-10-216-179-119:416605:416605 [4] NCCL INFO comm 0x7fd69c000c60 rank 28 nranks 32 cudaDev 4 busId 901c0 - Destroy COMPLETE
ip-10-216-178-255:416480:416480 [2] NCCL INFO comm 0x7fd66c000c60 rank 18 nranks 32 cudaDev 2 busId 201c0 - Destroy COMPLETE
ip-10-216-178-255:416484:416484 [6] NCCL INFO comm 0x7f191c000c60 rank 22 nranks 32 cudaDev 6 busId a01c0 - Destroy COMPLETE
ip-10-216-179-87:417137:417137 [6] NCCL INFO comm 0x7fe7fc000c60 rank 14 nranks 32 cudaDev 6 busId a01c0 - Destroy COMPLETE
ip-10-216-179-87:417135:417135 [4] NCCL INFO comm 0x7f8d90000c60 rank 12 nranks 32 cudaDev 4 busId 901c0 - Destroy COMPLETE
ip-10-216-179-87:417133:417133 [2] NCCL INFO comm 0x7f05f0000c60 rank 10 nranks 32 cudaDev 2 busId 201c0 - Destroy COMPLETE
ip-10-216-178-255:416482:416482 [4] NCCL INFO comm 0x7f7998000c60 rank 20 nranks 32 cudaDev 4 busId 901c0 - Destroy COMPLETE
rashikakheria commented 1 year ago

Is there a reason that you are using just 3 interfaces on a P4Dn rather than 4? Could you also provide dmesg output when the failure happens?

taruntandon88 commented 1 year ago

I've made the change to use 4 interfaces now (we wanted to understand the performace impact as well of going from 1 to 4). Here is the new fi_info result. fi_info -p efa -t FI_EP_RDM

provider: efa
    fabric: EFA-fe80::5a:feff:feea:aec1
    domain: rdmap16s27-rdm
    version: 111.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::fc:75ff:fe6c:e223
    domain: rdmap32s27-rdm
    version: 111.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::3e:52ff:fe5a:367d
    domain: rdmap144s27-rdm
    version: 111.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::90:eaff:fe36:3bed
    domain: rdmap160s27-rdm
    version: 111.10
    type: FI_EP_RDM

The error we get it

ip-10-216-181-207:3966:3966 [0] NCCL INFO Bootstrap : Using ens32:10.216.181.207<0>
ip-10-216-181-207:3966:3966 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-181-207:3966:3966 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-216-181-207:3966:3966 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
ip-10-216-181-207:3966:3966 [0] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1
libibverbs: Warning: RLIMIT_MEMLOCK is 100 bytes.
    This will severely limit memory registrations.
ip-10-216-181-207:3966:3966 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-216-181-207:3966:3966 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-181-207:3966:3966 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.6
...
...
...
libfabric:3966:efa:mr:efa_mr_reg_impl():312<warn> Unable to register MR: Unknown error -12
libfabric:3966:efa:mr:efa_mr_regattr():413<warn> Unable to register MR: Cannot allocate memory
libfabric:3966:efa:ep_ctrl:rxr_ep_post_buf():288<warn> Unable to allocate rx_pkt_entry

ip-10-216-181-207:3966:4034 [0] create_nccl_ofi_component:765 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
ip-10-216-181-207:3966:4034 [0] NCCL INFO include/net.h:20 -> 2
ip-10-216-181-207:3966:4034 [0] NCCL INFO transport/net.cc:199 -> 2
ip-10-216-181-207:3966:4034 [0] NCCL INFO transport.cc:34 -> 2
ip-10-216-181-207:3966:4034 [0] NCCL INFO transport.cc:84 -> 2
ip-10-216-181-207:3966:4034 [0] NCCL INFO init.cc:778 -> 2
ip-10-216-181-207:3966:4034 [0] NCCL INFO init.cc:904 -> 2
ip-10-216-181-207:3966:4034 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
libfabric:3968:efa:mr:efa_mr_reg_impl():312<warn> Unable to register MR: Unknown error -12
libfabric:3968:efa:mr:efa_mr_regattr():413<warn> Unable to register MR: Cannot allocate memory
libfabric:3968:efa:ep_ctrl:rxr_ep_post_buf():288<warn> Unable to allocate rx_pkt_entry

ip-10-216-181-207:3968:4041 [2] create_nccl_ofi_component:765 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
ip-10-216-181-207:3968:4041 [2] NCCL INFO include/net.h:20 -> 2
ip-10-216-181-207:3968:4041 [2] NCCL INFO transport/net.cc:199 -> 2
ip-10-216-181-207:3968:4041 [2] NCCL INFO transport.cc:34 -> 2
ip-10-216-181-207:3968:4041 [2] NCCL INFO transport.cc:84 -> 2
ip-10-216-181-207:3968:4041 [2] NCCL INFO init.cc:778 -> 2
ip-10-216-181-207:3968:4041 [2] NCCL INFO init.cc:904 -> 2
ip-10-216-181-207:3968:4041 [2] NCCL INFO group.cc:72 -> 2 [Async thread]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3967 closing signal SIGTERM

The dmesg output is as follows

[    0.000000] Linux version 5.13.0-1023-aws (buildd@lcy02-amd64-104) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #25~20.04.1-Ubuntu SMP Mon Apr 25 19:28:27 UTC 2022 (Ubuntu 5.13.0-1023.25~20.04.1-aws 5.13.19)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.0-1023-aws root=UUID=436cf32d-5e3d-46ca-b557-f870c8a25794 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Hygon HygonGenuine
[    0.000000]   Centaur CentaurHauls
[    0.000000]   zhaoxin   Shanghai  
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[    0.000000] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[    0.000000] x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
[    0.000000] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]:  512
[    0.000000] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
[    0.000000] x86/fpu: xstate_offset[9]: 2560, xstate_sizes[9]:    8
[    0.000000] x86/fpu: Enabled xstate features 0x2ff, context size is 2568 bytes, using 'compacted' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffe1fff] usable
[    0.000000] BIOS-e820: [mem 0x000000007ffe2000-0x000000007fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000e03fffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000008ef7ffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000008ef8000000-0x000000907fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000009080000000-0x0000011ef7ffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000011ef8000000-0x000001207fffffff] reserved
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.7 present.
[    0.000000] DMI: Amazon EC2 p4d.24xlarge/, BIOS 1.0 10/16/2017
[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 47fb801001, primary cpu clock
[    0.000001] kvm-clock: using sched offset of 5868486377 cycles
[    0.000003] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000006] tsc: Detected 2999.998 MHz processor
[    0.000318] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000321] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000325] last_pfn = 0x11ef8000 max_arch_pfn = 0x400000000
[    0.000370] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[    0.000377] last_pfn = 0x7ffe2 max_arch_pfn = 0x400000000
[    0.006564] Using GB pages for direct mapping
[    0.006768] RAMDISK: [mem 0x2c11b000-0x32084fff]
[    0.006784] ACPI: Early table checksum verification disabled
[    0.006790] ACPI: RSDP 0x00000000000F8F40 000014 (v00 AMAZON)
[    0.006796] ACPI: RSDT 0x000000007FFE7380 000044 (v01 AMAZON AMZNRSDT 00000001 AMZN 00000001)
[    0.006800] ACPI: FACP 0x000000007FFEFF80 000074 (v01 AMAZON AMZNFACP 00000001 AMZN 00000001)
[    0.006805] ACPI: DSDT 0x000000007FFE73D0 001F87 (v01 AMAZON AMZNDSDT 00000001 AMZN 00000001)
[    0.006808] ACPI: FACS 0x000000007FFEFF40 000040
[    0.006811] ACPI: SSDT 0x000000007FFEA190 005DA1 (v01 AMAZON AMZNSSDT 00000001 AMZN 00000001)
[    0.006813] ACPI: APIC 0x000000007FFE9CB0 000366 (v01 AMAZON AMZNAPIC 00000001 AMZN 00000001)
[    0.006816] ACPI: SRAT 0x000000007FFE9400 0006A8 (v01 AMAZON AMZNSRAT 00000001 AMZN 00000001)
[    0.006819] ACPI: SLIT 0x000000007FFE9390 00006C (v01 AMAZON AMZNSLIT 00000001 AMZN 00000001)
[    0.006822] ACPI: WAET 0x000000007FFE9360 000028 (v01 AMAZON AMZNWAET 00000001 AMZN 00000001)
[    0.006826] ACPI: HPET 0x00000000000C9000 000038 (v01 AMAZON AMZNHPET 00000001 AMZN 00000001)
[    0.006829] ACPI: SSDT 0x00000000000C9040 00007B (v01 AMAZON AMZNSSDT 00000001 AMZN 00000001)
[    0.006831] ACPI: Reserving FACP table memory at [mem 0x7ffeff80-0x7ffefff3]
[    0.006833] ACPI: Reserving DSDT table memory at [mem 0x7ffe73d0-0x7ffe9356]
[    0.006834] ACPI: Reserving FACS table memory at [mem 0x7ffeff40-0x7ffeff7f]
[    0.006834] ACPI: Reserving SSDT table memory at [mem 0x7ffea190-0x7ffeff30]
[    0.006835] ACPI: Reserving APIC table memory at [mem 0x7ffe9cb0-0x7ffea015]
[    0.006836] ACPI: Reserving SRAT table memory at [mem 0x7ffe9400-0x7ffe9aa7]
[    0.006837] ACPI: Reserving SLIT table memory at [mem 0x7ffe9390-0x7ffe93fb]
[    0.006838] ACPI: Reserving WAET table memory at [mem 0x7ffe9360-0x7ffe9387]
[    0.006839] ACPI: Reserving HPET table memory at [mem 0xc9000-0xc9037]
[    0.006840] ACPI: Reserving SSDT table memory at [mem 0xc9040-0xc90ba]
[    0.006858] ACPI: Local APIC address 0xfee00000
[    0.006922] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.006924] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.006926] SRAT: PXM 0 -> APIC 0x02 -> Node 0
[    0.006927] SRAT: PXM 0 -> APIC 0x03 -> Node 0
[    0.006929] SRAT: PXM 0 -> APIC 0x04 -> Node 0
[    0.006931] SRAT: PXM 0 -> APIC 0x05 -> Node 0
[    0.006933] SRAT: PXM 0 -> APIC 0x06 -> Node 0
[    0.006935] SRAT: PXM 0 -> APIC 0x07 -> Node 0
[    0.006936] SRAT: PXM 0 -> APIC 0x08 -> Node 0
[    0.006938] SRAT: PXM 0 -> APIC 0x09 -> Node 0
[    0.006940] SRAT: PXM 0 -> APIC 0x0a -> Node 0
[    0.006942] SRAT: PXM 0 -> APIC 0x0b -> Node 0
[    0.006944] SRAT: PXM 0 -> APIC 0x0c -> Node 0
[    0.006945] SRAT: PXM 0 -> APIC 0x0d -> Node 0
[    0.006947] SRAT: PXM 0 -> APIC 0x0e -> Node 0
[    0.006949] SRAT: PXM 0 -> APIC 0x0f -> Node 0
[    0.006951] SRAT: PXM 0 -> APIC 0x10 -> Node 0
[    0.006953] SRAT: PXM 0 -> APIC 0x11 -> Node 0
[    0.006954] SRAT: PXM 0 -> APIC 0x12 -> Node 0
[    0.006956] SRAT: PXM 0 -> APIC 0x13 -> Node 0
[    0.006958] SRAT: PXM 0 -> APIC 0x14 -> Node 0
[    0.006960] SRAT: PXM 0 -> APIC 0x15 -> Node 0
[    0.006961] SRAT: PXM 0 -> APIC 0x16 -> Node 0
[    0.006963] SRAT: PXM 0 -> APIC 0x17 -> Node 0
[    0.006965] SRAT: PXM 0 -> APIC 0x18 -> Node 0
[    0.006967] SRAT: PXM 0 -> APIC 0x19 -> Node 0
[    0.006969] SRAT: PXM 0 -> APIC 0x1a -> Node 0
[    0.006970] SRAT: PXM 0 -> APIC 0x1b -> Node 0
[    0.006972] SRAT: PXM 0 -> APIC 0x1c -> Node 0
[    0.006974] SRAT: PXM 0 -> APIC 0x1d -> Node 0
[    0.006976] SRAT: PXM 0 -> APIC 0x1e -> Node 0
[    0.006977] SRAT: PXM 0 -> APIC 0x1f -> Node 0
[    0.006979] SRAT: PXM 0 -> APIC 0x20 -> Node 0
[    0.006981] SRAT: PXM 0 -> APIC 0x21 -> Node 0
[    0.006983] SRAT: PXM 0 -> APIC 0x22 -> Node 0
[    0.006985] SRAT: PXM 0 -> APIC 0x23 -> Node 0
[    0.006986] SRAT: PXM 0 -> APIC 0x24 -> Node 0
[    0.006988] SRAT: PXM 0 -> APIC 0x25 -> Node 0
[    0.006990] SRAT: PXM 0 -> APIC 0x26 -> Node 0
[    0.006992] SRAT: PXM 0 -> APIC 0x27 -> Node 0
[    0.006993] SRAT: PXM 0 -> APIC 0x28 -> Node 0
[    0.006995] SRAT: PXM 0 -> APIC 0x29 -> Node 0
[    0.006997] SRAT: PXM 0 -> APIC 0x2a -> Node 0
[    0.006999] SRAT: PXM 0 -> APIC 0x2b -> Node 0
[    0.007001] SRAT: PXM 0 -> APIC 0x2c -> Node 0
[    0.007002] SRAT: PXM 0 -> APIC 0x2d -> Node 0
[    0.007004] SRAT: PXM 0 -> APIC 0x2e -> Node 0
[    0.007006] SRAT: PXM 0 -> APIC 0x2f -> Node 0
[    0.007008] SRAT: PXM 1 -> APIC 0x40 -> Node 1
[    0.007010] SRAT: PXM 1 -> APIC 0x41 -> Node 1
[    0.007011] SRAT: PXM 1 -> APIC 0x42 -> Node 1
[    0.007013] SRAT: PXM 1 -> APIC 0x43 -> Node 1
[    0.007015] SRAT: PXM 1 -> APIC 0x44 -> Node 1
[    0.007017] SRAT: PXM 1 -> APIC 0x45 -> Node 1
[    0.007018] SRAT: PXM 1 -> APIC 0x46 -> Node 1
[    0.007020] SRAT: PXM 1 -> APIC 0x47 -> Node 1
[    0.007022] SRAT: PXM 1 -> APIC 0x48 -> Node 1
[    0.007024] SRAT: PXM 1 -> APIC 0x49 -> Node 1
[    0.007026] SRAT: PXM 1 -> APIC 0x4a -> Node 1
[    0.007027] SRAT: PXM 1 -> APIC 0x4b -> Node 1
[    0.007029] SRAT: PXM 1 -> APIC 0x4c -> Node 1
[    0.007031] SRAT: PXM 1 -> APIC 0x4d -> Node 1
[    0.007033] SRAT: PXM 1 -> APIC 0x4e -> Node 1
[    0.007034] SRAT: PXM 1 -> APIC 0x4f -> Node 1
[    0.007036] SRAT: PXM 1 -> APIC 0x50 -> Node 1
[    0.007038] SRAT: PXM 1 -> APIC 0x51 -> Node 1
[    0.007040] SRAT: PXM 1 -> APIC 0x52 -> Node 1
[    0.007042] SRAT: PXM 1 -> APIC 0x53 -> Node 1
[    0.007043] SRAT: PXM 1 -> APIC 0x54 -> Node 1
[    0.007045] SRAT: PXM 1 -> APIC 0x55 -> Node 1
[    0.007047] SRAT: PXM 1 -> APIC 0x56 -> Node 1
[    0.007049] SRAT: PXM 1 -> APIC 0x57 -> Node 1
[    0.007050] SRAT: PXM 1 -> APIC 0x58 -> Node 1
[    0.007052] SRAT: PXM 1 -> APIC 0x59 -> Node 1
[    0.007054] SRAT: PXM 1 -> APIC 0x5a -> Node 1
[    0.007056] SRAT: PXM 1 -> APIC 0x5b -> Node 1
[    0.007057] SRAT: PXM 1 -> APIC 0x5c -> Node 1
[    0.007059] SRAT: PXM 1 -> APIC 0x5d -> Node 1
[    0.007061] SRAT: PXM 1 -> APIC 0x5e -> Node 1
[    0.007063] SRAT: PXM 1 -> APIC 0x5f -> Node 1
[    0.007065] SRAT: PXM 1 -> APIC 0x60 -> Node 1
[    0.007066] SRAT: PXM 1 -> APIC 0x61 -> Node 1
[    0.007068] SRAT: PXM 1 -> APIC 0x62 -> Node 1
[    0.007070] SRAT: PXM 1 -> APIC 0x63 -> Node 1
[    0.007072] SRAT: PXM 1 -> APIC 0x64 -> Node 1
[    0.007073] SRAT: PXM 1 -> APIC 0x65 -> Node 1
[    0.007075] SRAT: PXM 1 -> APIC 0x66 -> Node 1
[    0.007077] SRAT: PXM 1 -> APIC 0x67 -> Node 1
[    0.007079] SRAT: PXM 1 -> APIC 0x68 -> Node 1
[    0.007081] SRAT: PXM 1 -> APIC 0x69 -> Node 1
[    0.007082] SRAT: PXM 1 -> APIC 0x6a -> Node 1
[    0.007084] SRAT: PXM 1 -> APIC 0x6b -> Node 1
[    0.007086] SRAT: PXM 1 -> APIC 0x6c -> Node 1
[    0.007088] SRAT: PXM 1 -> APIC 0x6d -> Node 1
[    0.007089] SRAT: PXM 1 -> APIC 0x6e -> Node 1
[    0.007091] SRAT: PXM 1 -> APIC 0x6f -> Node 1
[    0.007095] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.007098] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x907fffffff]
[    0.007101] ACPI: SRAT: Node 1 PXM 1 [mem 0x9080000000-0x1207fffffff]
[    0.007108] NUMA: Initialized distance table, cnt=2
[    0.007110] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x907fffffff] -> [mem 0x00000000-0x907fffffff]
[    0.007118] NODE_DATA(0) allocated [mem 0x8ef7fd6000-0x8ef7ffffff]
[    0.007150] NODE_DATA(1) allocated [mem 0x11ef7fd3000-0x11ef7ffcfff]
[    0.008662] Zone ranges:
[    0.008663]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.008665]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.008667]   Normal   [mem 0x0000000100000000-0x0000011ef7ffffff]
[    0.008668]   Device   empty
[    0.008669] Movable zone start for each node
[    0.008672] Early memory node ranges
[    0.008672]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.008674]   node   0: [mem 0x0000000000100000-0x000000007ffe1fff]
[    0.008675]   node   0: [mem 0x0000000100000000-0x0000008ef7ffffff]
[    0.008706]   node   1: [mem 0x0000009080000000-0x0000011ef7ffffff]
[    0.008737] Initmem setup node 0 [mem 0x0000000000001000-0x0000008ef7ffffff]
[    0.008739] On node 0 totalpages: 149389184
[    0.008740]   DMA zone: 64 pages used for memmap
[    0.008741]   DMA zone: 158 pages reserved
[    0.008741]   DMA zone: 3998 pages, LIFO batch:0
[    0.008743]   DMA32 zone: 8128 pages used for memmap
[    0.008744]   DMA32 zone: 520162 pages, LIFO batch:63
[    0.008744]   Normal zone: 2326016 pages used for memmap
[    0.008745]   Normal zone: 148865024 pages, LIFO batch:63
[    0.008746] Initmem setup node 1 [mem 0x0000009080000000-0x0000011ef7ffffff]
[    0.008748] On node 1 totalpages: 149389312
[    0.008748]   Normal zone: 2334208 pages used for memmap
[    0.008749]   Normal zone: 149389312 pages, LIFO batch:63
[    0.008807] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.008832] On node 0, zone DMA: 97 pages in unavailable ranges
[    1.007176] On node 0, zone Normal: 30 pages in unavailable ranges
[    2.345979] ACPI: PM-Timer IO Port: 0xb008
[    2.345983] ACPI: Local APIC address 0xfee00000
[    2.345999] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    2.346032] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
[    2.346035] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    2.346037] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    2.346038] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    2.346039] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    2.346041] ACPI: IRQ5 used by override.
[    2.346042] ACPI: IRQ9 used by override.
[    2.346042] ACPI: IRQ10 used by override.
[    2.346043] ACPI: IRQ11 used by override.
[    2.346045] Using ACPI (MADT) for SMP configuration information
[    2.346046] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    2.346049] TSC deadline timer available
[    2.346050] smpboot: Allowing 96 CPUs, 0 hotplug CPUs
[    2.346080] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    2.346084] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[    2.346086] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000effff]
[    2.346088] PM: hibernation: Registered nosave memory: [mem 0x000f0000-0x000fffff]
[    2.346091] PM: hibernation: Registered nosave memory: [mem 0x7ffe2000-0x7fffffff]
[    2.346093] PM: hibernation: Registered nosave memory: [mem 0x80000000-0xdfffffff]
[    2.346095] PM: hibernation: Registered nosave memory: [mem 0xe0000000-0xe03fffff]
[    2.346097] PM: hibernation: Registered nosave memory: [mem 0xe0400000-0xfffbffff]
[    2.346099] PM: hibernation: Registered nosave memory: [mem 0xfffc0000-0xffffffff]
[    2.346102] PM: hibernation: Registered nosave memory: [mem 0x8ef8000000-0x907fffffff]
[    2.346105] [mem 0x80000000-0xdfffffff] available for PCI devices
[    2.346107] Booting paravirtualized kernel on KVM
[    2.346110] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    2.346120] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:96 nr_cpu_ids:96 nr_node_ids:2
[    2.358753] percpu: Embedded 65 pages/cpu s229376 r8192 d28672 u524288
[    2.358763] pcpu-alloc: s229376 r8192 d28672 u524288 alloc=1*2097152
[    2.358765] pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07 
[    2.358770] pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15 
[    2.358775] pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23 
[    2.358779] pcpu-alloc: [0] 48 49 50 51 [0] 52 53 54 55 
[    2.358783] pcpu-alloc: [0] 56 57 58 59 [0] 60 61 62 63 
[    2.358787] pcpu-alloc: [0] 64 65 66 67 [0] 68 69 70 71 
[    2.358791] pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31 
[    2.358796] pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39 
[    2.358800] pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47 
[    2.358804] pcpu-alloc: [1] 72 73 74 75 [1] 76 77 78 79 
[    2.358808] pcpu-alloc: [1] 80 81 82 83 [1] 84 85 86 87 
[    2.358812] pcpu-alloc: [1] 88 89 90 91 [1] 92 93 94 95 
[    2.358851] kvm-guest: stealtime: cpu 0, msr 8cbc837080
[    2.358854] kvm-guest: PV spinlocks disabled, no host support
[    2.358861] Built 2 zonelists, mobility grouping on.  Total pages: 294109922
[    2.358863] Policy zone: Normal
[    2.358864] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.13.0-1023-aws root=UUID=436cf32d-5e3d-46ca-b557-f870c8a25794 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295
[    2.358931] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    2.358932] printk: log_buf_len total cpu_extra contributions: 389120 bytes
[    2.358933] printk: log_buf_len min size: 262144 bytes
[    2.360386] printk: log_buf_len: 1048576 bytes
[    2.360388] printk: early log buf free: 246704(94%)
[    2.361383] mem auto-init: stack:off, heap alloc:on, heap free:off
[    4.946776] Memory: 1176201756K/1195113984K available (16393K kernel code, 3519K rwdata, 10532K rodata, 2896K init, 5724K bss, 18911968K reserved, 0K cma-reserved)
[    4.946782] random: get_random_u64 called from __kmem_cache_create+0x2d/0x440 with crng_init=0
[    4.948036] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=96, Nodes=2
[    4.948085] Kernel/User page tables isolation: enabled
[    4.948154] ftrace: allocating 49236 entries in 193 pages
[    4.962908] ftrace: allocated 193 pages with 3 groups
[    4.963462] rcu: Hierarchical RCU implementation.
[    4.963463] rcu:     RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=96.
[    4.963465]  Rude variant of Tasks RCU enabled.
[    4.963466]  Tracing variant of Tasks RCU enabled.
[    4.963467] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    4.963468] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=96
[    4.966837] NR_IRQS: 524544, nr_irqs: 1192, preallocated irqs: 16
[    4.967182] random: crng done (trusting CPU's manufacturer)
[    5.089395] Console: colour VGA+ 80x25
[    5.864795] printk: console [tty1] enabled
[    6.106796] printk: console [ttyS0] enabled
[    6.110291] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    6.117688] ACPI: Core revision 20210331
[    6.121062] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 30580167144 ns
[    6.127630] APIC: Switch to symmetric I/O mode setup
[    6.131464] x2apic enabled
[    6.135105] Switched APIC routing to physical x2apic.
[    6.140433] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2b3e43c8763, max_idle_ns: 440795360101 ns
[    6.147595] Calibrating delay loop (skipped) preset value.. 5999.99 BogoMIPS (lpj=11999992)
[    6.151594] pid_max: default: 98304 minimum: 768
[    6.151594] LSM: Security Framework initializing
[    6.151594] Yama: becoming mindful.
[    6.151594] AppArmor: AppArmor initialized
[    6.151594] Dentry cache hash table entries: 33554432 (order: 16, 268435456 bytes, vmalloc)
[    6.151594] Inode-cache hash table entries: 16777216 (order: 15, 134217728 bytes, vmalloc)
[    6.151594] Mount-cache hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)
[    6.151594] Mountpoint-cache hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)
[    6.151594] process: using mwait in idle threads
[    6.151594] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
[    6.151594] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
[    6.151594] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    6.151594] Spectre V2 : Mitigation: Retpolines
[    6.151594] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    6.151594] Speculative Store Bypass: Vulnerable
[    6.151594] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[    6.151594] Freeing SMP alternatives memory: 40K
[    6.151594] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1262
[    6.151594] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz (family: 0x6, model: 0x55, stepping: 0x7)
[    6.151856] Performance Events: Skylake events, Intel PMU driver.
[    6.155598] ... version:                2
[    6.158782] ... bit width:              48
[    6.159596] ... generic registers:      4
[    6.162768] ... value mask:             0000ffffffffffff
[    6.163596] ... max period:             000000007fffffff
[    6.167248] ... fixed-purpose events:   3
[    6.167596] ... event mask:             000000070000000f
[    6.171451] rcu: Hierarchical SRCU implementation.
[    6.173244] smp: Bringing up secondary CPUs ...
[    6.175736] x86: Booting SMP configuration:
[    6.178976] .... node  #0, CPUs:        #1
[    1.184458] kvm-clock: cpu 1, msr 47fb801041, secondary cpu clock
[    6.181931] kvm-guest: stealtime: cpu 1, msr 8cbc8b7080
[    6.187719]   #2
[    1.184458] kvm-clock: cpu 2, msr 47fb801081, secondary cpu clock
[    6.189319] kvm-guest: stealtime: cpu 2, msr 8cbc937080
[    6.195712]   #3
[    1.184458] kvm-clock: cpu 3, msr 47fb8010c1, secondary cpu clock
[    6.200580] kvm-guest: stealtime: cpu 3, msr 8cbc9b7080
[    6.207719]   #4
[    1.184458] kvm-clock: cpu 4, msr 47fb801101, secondary cpu clock
[    6.209176] kvm-guest: stealtime: cpu 4, msr 8cbca37080
[    6.215714]   #5
[    1.184458] kvm-clock: cpu 5, msr 47fb801141, secondary cpu clock
[    6.217172] kvm-guest: stealtime: cpu 5, msr 8cbcab7080
[    6.223703]   #6
[    1.184458] kvm-clock: cpu 6, msr 47fb801181, secondary cpu clock
[    6.228145] kvm-guest: stealtime: cpu 6, msr 8cbcb37080
[    6.235719]   #7
[    1.184458] kvm-clock: cpu 7, msr 47fb8011c1, secondary cpu clock
[    6.237193] kvm-guest: stealtime: cpu 7, msr 8cbcbb7080
[    6.243709]   #8
[    1.184458] kvm-clock: cpu 8, msr 47fb801201, secondary cpu clock
[    6.245153] kvm-guest: stealtime: cpu 8, msr 8cbcc37080
[    6.251707]   #9
[    1.184458] kvm-clock: cpu 9, msr 47fb801241, secondary cpu clock
[    6.255779] kvm-guest: stealtime: cpu 9, msr 8cbccb7080
[    6.259715]  #10
[    1.184458] kvm-clock: cpu 10, msr 47fb801281, secondary cpu clock
[    6.265012] kvm-guest: stealtime: cpu 10, msr 8cbcd37080
[    6.271713]  #11
[    1.184458] kvm-clock: cpu 11, msr 47fb8012c1, secondary cpu clock
[    6.273167] kvm-guest: stealtime: cpu 11, msr 8cbcdb7080
[    6.279713]  #12
[    1.184458] kvm-clock: cpu 12, msr 47fb801301, secondary cpu clock
[    6.283631] kvm-guest: stealtime: cpu 12, msr 8cbce37080
[    6.287715]  #13
[    1.184458] kvm-clock: cpu 13, msr 47fb801341, secondary cpu clock
[    6.292934] kvm-guest: stealtime: cpu 13, msr 8cbceb7080
[    6.299712]  #14
[    1.184458] kvm-clock: cpu 14, msr 47fb801381, secondary cpu clock
[    6.301168] kvm-guest: stealtime: cpu 14, msr 8cbcf37080
[    6.307721]  #15
[    1.184458] kvm-clock: cpu 15, msr 47fb8013c1, secondary cpu clock
[    6.309168] kvm-guest: stealtime: cpu 15, msr 8cbcfb7080
[    6.315706]  #16
[    1.184458] kvm-clock: cpu 16, msr 47fb801401, secondary cpu clock
[    6.320837] kvm-guest: stealtime: cpu 16, msr 8cbd037080
[    6.327710]  #17
[    1.184458] kvm-clock: cpu 17, msr 47fb801441, secondary cpu clock
[    6.329159] kvm-guest: stealtime: cpu 17, msr 8cbd0b7080
[    6.335709]  #18
[    1.184458] kvm-clock: cpu 18, msr 47fb801481, secondary cpu clock
[    6.337164] kvm-guest: stealtime: cpu 18, msr 8cbd137080
[    6.343729]  #19
[    1.184458] kvm-clock: cpu 19, msr 47fb8014c1, secondary cpu clock
[    6.348587] kvm-guest: stealtime: cpu 19, msr 8cbd1b7080
[    6.355704]  #20
[    1.184458] kvm-clock: cpu 20, msr 47fb801501, secondary cpu clock
[    6.357149] kvm-guest: stealtime: cpu 20, msr 8cbd237080
[    6.363710]  #21
[    1.184458] kvm-clock: cpu 21, msr 47fb801541, secondary cpu clock
[    6.365146] kvm-guest: stealtime: cpu 21, msr 8cbd2b7080
[    6.371715]  #22
[    1.184458] kvm-clock: cpu 22, msr 47fb801581, secondary cpu clock
[    6.376391] kvm-guest: stealtime: cpu 22, msr 8cbd337080
[    6.383714]  #23
[    1.184458] kvm-clock: cpu 23, msr 47fb8015c1, secondary cpu clock
[    6.385172] kvm-guest: stealtime: cpu 23, msr 8cbd3b7080

[    6.487596] .... node  #1, CPUs:   #24
[    1.184458] kvm-clock: cpu 24, msr 47fb801601, secondary cpu clock
[    1.184458] smpboot: CPU 24 Converting physical 0 to logical die 1
[    6.492340] kvm-guest: stealtime: cpu 24, msr 11cbc837080
[    6.495750]  #25
[    1.184458] kvm-clock: cpu 25, msr 47fb801641, secondary cpu clock
[    6.497347] kvm-guest: stealtime: cpu 25, msr 11cbc8b7080
[    6.503734]  #26
[    1.184458] kvm-clock: cpu 26, msr 47fb801681, secondary cpu clock
[    6.505168] kvm-guest: stealtime: cpu 26, msr 11cbc937080
[    6.511756]  #27
[    1.184458] kvm-clock: cpu 27, msr 47fb8016c1, secondary cpu clock
[    6.516724] kvm-guest: stealtime: cpu 27, msr 11cbc9b7080
[    6.523744]  #28
[    1.184458] kvm-clock: cpu 28, msr 47fb801701, secondary cpu clock
[    6.525171] kvm-guest: stealtime: cpu 28, msr 11cbca37080
[    6.531742]  #29
[    1.184458] kvm-clock: cpu 29, msr 47fb801741, secondary cpu clock
[    6.533180] kvm-guest: stealtime: cpu 29, msr 11cbcab7080
[    6.539733]  #30
[    1.184458] kvm-clock: cpu 30, msr 47fb801781, secondary cpu clock
[    6.544902] kvm-guest: stealtime: cpu 30, msr 11cbcb37080
[    6.551749]  #31
[    1.184458] kvm-clock: cpu 31, msr 47fb8017c1, secondary cpu clock
[    6.553190] kvm-guest: stealtime: cpu 31, msr 11cbcbb7080
[    6.559742]  #32
[    1.184458] kvm-clock: cpu 32, msr 47fb801801, secondary cpu clock
[    6.563697] kvm-guest: stealtime: cpu 32, msr 11cbcc37080
[    6.571704]  #33
[    1.184458] kvm-clock: cpu 33, msr 47fb801841, secondary cpu clock
[    6.573125] kvm-guest: stealtime: cpu 33, msr 11cbccb7080
[    6.579740]  #34
[    1.184458] kvm-clock: cpu 34, msr 47fb801881, secondary cpu clock
[    6.581152] kvm-guest: stealtime: cpu 34, msr 11cbcd37080
[    6.587755]  #35
[    1.184458] kvm-clock: cpu 35, msr 47fb8018c1, secondary cpu clock
[    6.591853] kvm-guest: stealtime: cpu 35, msr 11cbcdb7080
[    6.599736]  #36
[    1.184458] kvm-clock: cpu 36, msr 47fb801901, secondary cpu clock
[    6.601170] kvm-guest: stealtime: cpu 36, msr 11cbce37080
[    6.607752]  #37
[    1.184458] kvm-clock: cpu 37, msr 47fb801941, secondary cpu clock
[    6.609208] kvm-guest: stealtime: cpu 37, msr 11cbceb7080
[    6.615749]  #38
[    1.184458] kvm-clock: cpu 38, msr 47fb801981, secondary cpu clock
[    6.620163] kvm-guest: stealtime: cpu 38, msr 11cbcf37080
[    6.627755]  #39
[    1.184458] kvm-clock: cpu 39, msr 47fb8019c1, secondary cpu clock
[    6.629165] kvm-guest: stealtime: cpu 39, msr 11cbcfb7080
[    6.635735]  #40
[    1.184458] kvm-clock: cpu 40, msr 47fb801a01, secondary cpu clock
[    6.637163] kvm-guest: stealtime: cpu 40, msr 11cbd037080
[    6.643735]  #41
[    1.184458] kvm-clock: cpu 41, msr 47fb801a41, secondary cpu clock
[    6.648335] kvm-guest: stealtime: cpu 41, msr 11cbd0b7080
[    6.655753]  #42
[    1.184458] kvm-clock: cpu 42, msr 47fb801a81, secondary cpu clock
[    6.657199] kvm-guest: stealtime: cpu 42, msr 11cbd137080
[    6.663742]  #43
[    1.184458] kvm-clock: cpu 43, msr 47fb801ac1, secondary cpu clock
[    6.665189] kvm-guest: stealtime: cpu 43, msr 11cbd1b7080
[    6.671749]  #44
[    1.184458] kvm-clock: cpu 44, msr 47fb801b01, secondary cpu clock
[    6.676589] kvm-guest: stealtime: cpu 44, msr 11cbd237080
[    6.683738]  #45
[    1.184458] kvm-clock: cpu 45, msr 47fb801b41, secondary cpu clock
[    6.685134] kvm-guest: stealtime: cpu 45, msr 11cbd2b7080
[    6.691738]  #46
[    1.184458] kvm-clock: cpu 46, msr 47fb801b81, secondary cpu clock
[    6.693179] kvm-guest: stealtime: cpu 46, msr 11cbd337080
[    6.699756]  #47
[    1.184458] kvm-clock: cpu 47, msr 47fb801bc1, secondary cpu clock
[    6.704808] kvm-guest: stealtime: cpu 47, msr 11cbd3b7080

[    6.713958] .... node  #0, CPUs:   #48
[    1.184458] kvm-clock: cpu 48, msr 47fb801c01, secondary cpu clock
[    6.717011] kvm-guest: stealtime: cpu 48, msr 8cbd437080
[    6.723896] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[    6.727735]  #49
[    1.184458] kvm-clock: cpu 49, msr 47fb801c41, secondary cpu clock
[    6.729173] kvm-guest: stealtime: cpu 49, msr 8cbd4b7080
[    6.735709]  #50
[    1.184458] kvm-clock: cpu 50, msr 47fb801c81, secondary cpu clock
[    6.737118] kvm-guest: stealtime: cpu 50, msr 8cbd537080
[    6.743712]  #51
[    1.184458] kvm-clock: cpu 51, msr 47fb801cc1, secondary cpu clock
[    6.747647] kvm-guest: stealtime: cpu 51, msr 8cbd5b7080
[    6.751729]  #52
[    1.184458] kvm-clock: cpu 52, msr 47fb801d01, secondary cpu clock
[    6.756897] kvm-guest: stealtime: cpu 52, msr 8cbd637080
[    6.763716]  #53
[    1.184458] kvm-clock: cpu 53, msr 47fb801d41, secondary cpu clock
[    6.765148] kvm-guest: stealtime: cpu 53, msr 8cbd6b7080
[    6.771715]  #54
[    1.184458] kvm-clock: cpu 54, msr 47fb801d81, secondary cpu clock
[    6.773154] kvm-guest: stealtime: cpu 54, msr 8cbd737080
[    6.779717]  #55
[    1.184458] kvm-clock: cpu 55, msr 47fb801dc1, secondary cpu clock
[    6.784657] kvm-guest: stealtime: cpu 55, msr 8cbd7b7080
[    6.791718]  #56
[    1.184458] kvm-clock: cpu 56, msr 47fb801e01, secondary cpu clock
[    6.793116] kvm-guest: stealtime: cpu 56, msr 8cbd837080
[    6.799719]  #57
[    1.184458] kvm-clock: cpu 57, msr 47fb801e41, secondary cpu clock
[    6.801131] kvm-guest: stealtime: cpu 57, msr 8cbd8b7080
[    6.807710]  #58
[    1.184458] kvm-clock: cpu 58, msr 47fb801e81, secondary cpu clock
[    6.812422] kvm-guest: stealtime: cpu 58, msr 8cbd937080
[    6.819721]  #59
[    1.184458] kvm-clock: cpu 59, msr 47fb801ec1, secondary cpu clock
[    6.821156] kvm-guest: stealtime: cpu 59, msr 8cbd9b7080
[    6.827712]  #60
[    1.184458] kvm-clock: cpu 60, msr 47fb801f01, secondary cpu clock
[    6.829152] kvm-guest: stealtime: cpu 60, msr 8cbda37080
[    6.835719]  #61
[    1.184458] kvm-clock: cpu 61, msr 47fb801f41, secondary cpu clock
[    6.840303] kvm-guest: stealtime: cpu 61, msr 8cbdab7080
[    6.847713]  #62
[    1.184458] kvm-clock: cpu 62, msr 47fb801f81, secondary cpu clock
[    6.849144] kvm-guest: stealtime: cpu 62, msr 8cbdb37080
[    6.855713]  #63
[    1.184458] kvm-clock: cpu 63, msr 47fb801fc1, secondary cpu clock
[    6.857110] kvm-guest: stealtime: cpu 63, msr 8cbdbb7080
[    6.863714]  #64
[    1.184458] kvm-clock: cpu 64, msr 118f68001, secondary cpu clock
[    6.867975] kvm-guest: stealtime: cpu 64, msr 8cbdc37080
[    6.875709]  #65
[    1.184458] kvm-clock: cpu 65, msr 118f68041, secondary cpu clock
[    6.877120] kvm-guest: stealtime: cpu 65, msr 8cbdcb7080
[    6.883722]  #66
[    1.184458] kvm-clock: cpu 66, msr 118f68081, secondary cpu clock
[    6.885145] kvm-guest: stealtime: cpu 66, msr 8cbdd37080
[    6.891721]  #67
[    1.184458] kvm-clock: cpu 67, msr 118f680c1, secondary cpu clock
[    6.893150] kvm-guest: stealtime: cpu 67, msr 8cbddb7080
[    6.899709]  #68
[    1.184458] kvm-clock: cpu 68, msr 118f68101, secondary cpu clock
[    6.904692] kvm-guest: stealtime: cpu 68, msr 8cbde37080
[    6.911721]  #69
[    1.184458] kvm-clock: cpu 69, msr 118f68141, secondary cpu clock
[    6.913130] kvm-guest: stealtime: cpu 69, msr 8cbdeb7080
[    6.919717]  #70
[    1.184458] kvm-clock: cpu 70, msr 118f68181, secondary cpu clock
[    6.921118] kvm-guest: stealtime: cpu 70, msr 8cbdf37080
[    6.927714]  #71
[    1.184458] kvm-clock: cpu 71, msr 118f681c1, secondary cpu clock
[    6.932303] kvm-guest: stealtime: cpu 71, msr 8cbdfb7080

[    6.941980] .... node  #1, CPUs:   #72
[    1.184458] kvm-clock: cpu 72, msr 118f68201, secondary cpu clock
[    6.944128] kvm-guest: stealtime: cpu 72, msr 11cbd437080
[    6.951750]  #73
[    1.184458] kvm-clock: cpu 73, msr 118f68241, secondary cpu clock
[    6.953182] kvm-guest: stealtime: cpu 73, msr 11cbd4b7080
[    6.959742]  #74
[    1.184458] kvm-clock: cpu 74, msr 118f68281, secondary cpu clock
[    6.961148] kvm-guest: stealtime: cpu 74, msr 11cbd537080
[    6.967755]  #75
[    1.184458] kvm-clock: cpu 75, msr 118f682c1, secondary cpu clock
[    6.972284] kvm-guest: stealtime: cpu 75, msr 11cbd5b7080
[    6.979797]  #76
[    1.184458] kvm-clock: cpu 76, msr 118f68301, secondary cpu clock
[    6.981191] kvm-guest: stealtime: cpu 76, msr 11cbd637080
[    6.987756]  #77
[    1.184458] kvm-clock: cpu 77, msr 118f68341, secondary cpu clock
[    6.989182] kvm-guest: stealtime: cpu 77, msr 11cbd6b7080
[    6.995747]  #78
[    1.184458] kvm-clock: cpu 78, msr 118f68381, secondary cpu clock
[    7.000611] kvm-guest: stealtime: cpu 78, msr 11cbd737080
[    7.007755]  #79
[    1.184458] kvm-clock: cpu 79, msr 118f683c1, secondary cpu clock
[    7.009177] kvm-guest: stealtime: cpu 79, msr 11cbd7b7080
[    7.015743]  #80
[    1.184458] kvm-clock: cpu 80, msr 118f68401, secondary cpu clock
[    7.017134] kvm-guest: stealtime: cpu 80, msr 11cbd837080
[    7.023751]  #81
[    1.184458] kvm-clock: cpu 81, msr 118f68441, secondary cpu clock
[    7.028730] kvm-guest: stealtime: cpu 81, msr 11cbd8b7080
[    7.035744]  #82
[    1.184458] kvm-clock: cpu 82, msr 118f68481, secondary cpu clock
[    7.037130] kvm-guest: stealtime: cpu 82, msr 11cbd937080
[    7.043751]  #83
[    1.184458] kvm-clock: cpu 83, msr 118f684c1, secondary cpu clock
[    7.045155] kvm-guest: stealtime: cpu 83, msr 11cbd9b7080
[    7.051754]  #84
[    1.184458] kvm-clock: cpu 84, msr 118f68501, secondary cpu clock
[    7.056747] kvm-guest: stealtime: cpu 84, msr 11cbda37080
[    7.063749]  #85
[    1.184458] kvm-clock: cpu 85, msr 118f68541, secondary cpu clock
[    7.065179] kvm-guest: stealtime: cpu 85, msr 11cbdab7080
[    7.071738]  #86
[    1.184458] kvm-clock: cpu 86, msr 118f68581, secondary cpu clock
[    7.073164] kvm-guest: stealtime: cpu 86, msr 11cbdb37080
[    7.079764]  #87
[    1.184458] kvm-clock: cpu 87, msr 118f685c1, secondary cpu clock
[    7.084926] kvm-guest: stealtime: cpu 87, msr 11cbdbb7080
[    7.091739]  #88
[    1.184458] kvm-clock: cpu 88, msr 118f68601, secondary cpu clock
[    7.093156] kvm-guest: stealtime: cpu 88, msr 11cbdc37080
[    7.099769]  #89
[    1.184458] kvm-clock: cpu 89, msr 118f68641, secondary cpu clock
[    7.103626] kvm-guest: stealtime: cpu 89, msr 11cbdcb7080
[    7.111603]  #90
[    1.184458] kvm-clock: cpu 90, msr 118f68681, secondary cpu clock
[    7.113032] kvm-guest: stealtime: cpu 90, msr 11cbdd37080
[    7.119747]  #91
[    1.184458] kvm-clock: cpu 91, msr 118f686c1, secondary cpu clock
[    7.121182] kvm-guest: stealtime: cpu 91, msr 11cbddb7080
[    7.127749]  #92
[    1.184458] kvm-clock: cpu 92, msr 118f68701, secondary cpu clock
[    7.131767] kvm-guest: stealtime: cpu 92, msr 11cbde37080
[    7.139710]  #93
[    1.184458] kvm-clock: cpu 93, msr 118f68741, secondary cpu clock
[    7.141114] kvm-guest: stealtime: cpu 93, msr 11cbdeb7080
[    7.147757]  #94
[    1.184458] kvm-clock: cpu 94, msr 118f68781, secondary cpu clock
[    7.149198] kvm-guest: stealtime: cpu 94, msr 11cbdf37080
[    7.155745]  #95
[    1.184458] kvm-clock: cpu 95, msr 118f687c1, secondary cpu clock
[    7.160118] kvm-guest: stealtime: cpu 95, msr 11cbdfb7080
[    7.167754] smp: Brought up 2 nodes, 96 CPUs
[    7.170993] smpboot: Max logical packages: 2
[    7.171599] smpboot: Total of 96 processors activated (575999.61 BogoMIPS)
[    7.231219] devtmpfs: initialized
[    7.231634] x86/mm: Memory block size: 128MB
[    7.295189] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    7.296044] futex hash table entries: 32768 (order: 9, 2097152 bytes, vmalloc)
[    7.299979] pinctrl core: initialized pinctrl subsystem
[    7.303845] PM: RTC time: 07:10:05, date: 2022-07-28
[    7.307793] NET: Registered protocol family 16
[    7.312026] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations
[    7.316428] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
[    7.320416] DMA: preallocated 4096 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
[    7.323605] audit: initializing netlink subsys (disabled)
[    7.327320] audit: type=2000 audit(1658992205.187:1): state=initialized audit_enabled=0 res=1
[    7.327320] thermal_sys: Registered thermal governor 'fair_share'
[    7.327597] thermal_sys: Registered thermal governor 'bang_bang'
[    7.331563] thermal_sys: Registered thermal governor 'step_wise'
[    7.331596] thermal_sys: Registered thermal governor 'user_space'
[    7.335552] thermal_sys: Registered thermal governor 'power_allocator'
[    7.335600] EISA bus registered
[    7.342473] cpuidle: using governor ladder
[    7.343613] cpuidle: using governor menu
[    7.347792] ACPI: bus type PCI registered
[    7.350981] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    7.351895] PCI: Using configuration type 1 for base access
[    7.369483] Kprobes globally optimized
[    7.371675] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    7.375599] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    7.379761] ACPI: Added _OSI(Module Device)
[    7.383598] ACPI: Added _OSI(Processor Device)
[    7.386963] ACPI: Added _OSI(3.0 _SCP Extensions)
[    7.387612] ACPI: Added _OSI(Processor Aggregator Device)
[    7.391328] ACPI: Added _OSI(Linux-Dell-Video)
[    7.391596] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    7.395266] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    7.400856] ACPI: 3 ACPI AML tables successfully acquired and loaded
[    7.410818] ACPI: Interpreter enabled
[    7.411604] ACPI: (supports S0 S4 S5)
[    7.414638] ACPI: Using IOAPIC for interrupt routing
[    7.415605] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    7.420107] ACPI: Enabled 16 GPEs in block 00 to 0F
[    7.436179] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00])
[    7.439601] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    7.443605] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[    7.447894] acpiphp: Slot [3] registered
[    7.451040] acpiphp: Slot [4] registered
[    7.451609] acpiphp: Slot [5] registered
[    7.454737] acpiphp: Slot [6] registered
[    7.455609] acpiphp: Slot [7] registered
[    7.458771] acpiphp: Slot [8] registered
[    7.459609] acpiphp: Slot [9] registered
[    7.462785] acpiphp: Slot [10] registered
[    7.463610] acpiphp: Slot [11] registered
[    7.466789] acpiphp: Slot [12] registered
[    7.467609] acpiphp: Slot [13] registered
[    7.470790] acpiphp: Slot [14] registered
[    7.471610] acpiphp: Slot [15] registered
[    7.474831] acpiphp: Slot [16] registered
[    7.475609] acpiphp: Slot [17] registered
[    7.478803] acpiphp: Slot [18] registered
[    7.479609] acpiphp: Slot [19] registered
[    7.482811] acpiphp: Slot [20] registered
[    7.483610] acpiphp: Slot [21] registered
[    7.486807] acpiphp: Slot [22] registered
[    7.487609] acpiphp: Slot [23] registered
[    7.490790] acpiphp: Slot [24] registered
[    7.491609] acpiphp: Slot [25] registered
[    7.494767] acpiphp: Slot [26] registered
[    7.495612] acpiphp: Slot [27] registered
[    7.498798] acpiphp: Slot [28] registered
[    7.499609] acpiphp: Slot [29] registered
[    7.502791] acpiphp: Slot [30] registered
[    7.503609] acpiphp: Slot [31] registered
[    7.506790] PCI host bridge to bus 0000:00
[    7.507597] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[    7.511597] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    7.515597] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    7.519596] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    7.523597] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xc3ffffff window]
[    7.527627] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
[    7.532163] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
[    7.536596] pci 0000:00:01.3: [8086:7113] type 00 class 0x000000
[    7.540347] pci 0000:00:01.3: quirk: [io  0xb000-0xb03f] claimed by PIIX4 ACPI
[    7.543613] pci 0000:00:01.3: quirk: [io  0xb100-0xb10f] claimed by PIIX4 SMB
[    7.547645] pci 0000:00:01.3: PIIX4 devres E PIO at fff0-ffff
[    7.551498] pci 0000:00:01.3: PIIX4 devres F MMIO at ffc00000-ffffffff
[    7.551612] pci 0000:00:01.3: PIIX4 devres G PIO at fff0-ffff
[    7.555419] pci 0000:00:01.3: PIIX4 devres H MMIO at ffc00000-ffffffff
[    7.555612] pci 0000:00:01.3: PIIX4 devres I PIO at fff0-ffff
[    7.559441] pci 0000:00:01.3: PIIX4 devres J PIO at fff0-ffff
[    7.559597] pci 0000:00:01.3: quirk_piix4_acpi+0x0/0x170 took 19531 usecs
[    7.564310] pci 0000:00:03.0: [1d0f:1111] type 00 class 0x030000
[    7.568531] pci 0000:00:03.0: reg 0x10: [mem 0xc2000000-0xc23fffff pref]
[    7.575573] pci 0000:00:03.0: reg 0x30: [mem 0xc0000000-0xc000ffff pref]
[    7.575674] pci 0000:00:03.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    7.580074] pci 0000:00:04.0: [1d0f:8061] type 00 class 0x010802
[    7.585056] pci 0000:00:04.0: reg 0x10: [mem 0xc0010000-0xc0013fff]
[    7.600982] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802
[    7.604847] pci 0000:00:1f.0: reg 0x10: [mem 0xc0014000-0xc0017fff]
[    7.614873] ACPI: PCI Root Bridge [PC01] (domain 0000 [bus 10])
[    7.615603] acpi PNP0A03:01: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    7.619612] acpi PNP0A03:01: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[    7.623847] acpiphp: Slot [32] registered
[    7.627002] acpiphp: Slot [33] registered
[    7.627609] acpiphp: Slot [34] registered
[    7.630810] acpiphp: Slot [35] registered
[    7.631609] acpiphp: Slot [36] registered
[    7.634797] acpiphp: Slot [37] registered
[    7.635609] acpiphp: Slot [38] registered
[    7.638812] acpiphp: Slot [39] registered
[    7.639609] acpiphp: Slot [40] registered
[    7.642802] acpiphp: Slot [41] registered
[    7.643610] acpiphp: Slot [42] registered
[    7.646810] acpiphp: Slot [43] registered
[    7.647608] acpiphp: Slot [44] registered
[    7.650792] acpiphp: Slot [45] registered
[    7.651611] acpiphp: Slot [46] registered
[    7.654822] acpiphp: Slot [47] registered
[    7.655609] acpiphp: Slot [48] registered
[    7.658790] acpiphp: Slot [49] registered
[    7.659608] acpiphp: Slot [50] registered
[    7.662842] acpiphp: Slot [51] registered
[    7.663611] acpiphp: Slot [52] registered
[    7.666774] acpiphp: Slot [53] registered
[    7.667608] acpiphp: Slot [54] registered
[    7.670820] acpiphp: Slot [55] registered
[    7.671609] acpiphp: Slot [56] registered
[    7.674787] acpiphp: Slot [57] registered
[    7.675609] acpiphp: Slot [58] registered
[    7.678798] acpiphp: Slot [59] registered
[    7.679609] acpiphp: Slot [60] registered
[    7.682790] acpiphp: Slot [61] registered
[    7.683609] acpiphp: Slot [62] registered
[    7.686803] acpiphp: Slot [63] registered
[    7.687605] PCI host bridge to bus 0000:10
[    7.690800] pci_bus 0000:10: root bus resource [bus 10]
[    7.691597] pci_bus 0000:10: root bus resource [mem 0xc4000000-0xc7ffffff window]
[    7.695597] pci_bus 0000:10: root bus resource [mem 0x39c000000000-0x39f4e80fffff window]
[    7.699678] pci 0000:10:00.0: [1d0f:ec20] type 00 class 0x020000
[    7.705879] pci 0000:10:00.0: reg 0x10: [mem 0xc6800000-0xc6803fff]
[    7.711580] pci 0000:10:00.0: reg 0x18: [mem 0x39d417c00000-0x39d417ffffff 64bit pref]
[    7.716074] pci 0000:10:00.0: enabling Extended Tags
[    7.720302] pci 0000:10:01.0: [1d0f:ec20] type 00 class 0x020000
[    7.725591] pci 0000:10:01.0: reg 0x10: [mem 0xc6804000-0xc6807fff]
[    7.731397] pci 0000:10:01.0: reg 0x18: [mem 0x39d417800000-0x39d417bfffff 64bit pref]
[    7.736062] pci 0000:10:01.0: enabling Extended Tags
[    7.746020] pci 0000:10:1b.0: [1d0f:efa0] type 00 class 0x020000
[    7.750098] pci 0000:10:1b.0: reg 0x10: [mem 0xc6808000-0xc680bfff]
[    7.756089] pci 0000:10:1b.0: reg 0x18: [mem 0x39d418000000-0x39d41fffffff 64bit pref]
[    7.762030] pci 0000:10:1b.0: reg 0x20: [mem 0xc6000000-0xc67fffff]
[    7.768074] pci 0000:10:1b.0: enabling Extended Tags
[    7.772370] pci 0000:10:1c.0: [10de:20b0] type 00 class 0x030200
[    7.789325] pci 0000:10:1c.0: reg 0x10: [mem 0xc4000000-0xc4ffffff]
[    7.797315] pci 0000:10:1c.0: reg 0x14: [mem 0x39e000000000-0x39efffffffff 64bit pref]
[    7.805304] pci 0000:10:1c.0: reg 0x1c: [mem 0x39f420000000-0x39f421ffffff 64bit pref]
[    7.817393] pci 0000:10:1c.0: Enabling HDA controller
[    7.820050] pci 0000:10:1c.0: PME# supported from D0 D3hot
[    7.824266] pci 0000:10:1d.0: [10de:20b0] type 00 class 0x030200
[    7.841300] pci 0000:10:1d.0: reg 0x10: [mem 0xc5000000-0xc5ffffff]
[    7.849317] pci 0000:10:1d.0: reg 0x14: [mem 0x39c000000000-0x39cfffffffff 64bit pref]
[    7.857292] pci 0000:10:1d.0: reg 0x1c: [mem 0x39d420000000-0x39d421ffffff 64bit pref]
[    7.869422] pci 0000:10:1d.0: Enabling HDA controller
[    7.872054] pci 0000:10:1d.0: PME# supported from D0 D3hot
[    7.876186] pci 0000:10:1e.0: [1d0f:cd01] type 00 class 0x010802
[    7.880884] pci 0000:10:1e.0: reg 0x10: [mem 0xc680c000-0xc680ffff]
[    7.886054] pci 0000:10:1e.0: reg 0x18: [mem 0x39d4177fe000-0x39d4177fffff 64bit pref]
[    7.891929] pci 0000:10:1f.0: [1d0f:cd01] type 00 class 0x010802
[    7.896868] pci 0000:10:1f.0: reg 0x10: [mem 0xc6810000-0xc6813fff]
[    7.902117] pci 0000:10:1f.0: reg 0x18: [mem 0x39d4177fc000-0x39d4177fdfff 64bit pref]
[    7.907674] pci_bus 0000:10: on NUMA node 0
[    7.907814] ACPI: PCI Root Bridge [PC02] (domain 0000 [bus 20])
[    7.911599] acpi PNP0A03:02: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    7.915611] acpi PNP0A03:02: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[    7.919844] acpiphp: Slot [64] registered
[    7.923007] acpiphp: Slot [65] registered
[    7.923612] acpiphp: Slot [66] registered
[    7.926772] acpiphp: Slot [67] registered
[    7.927609] acpiphp: Slot [68] registered
[    7.930805] acpiphp: Slot [69] registered
[    7.931609] acpiphp: Slot [70] registered
[    7.934783] acpiphp: Slot [71] registered
[    7.935610] acpiphp: Slot [72] registered
[    7.938797] acpiphp: Slot [73] registered
[    7.939609] acpiphp: Slot [74] registered
[    7.942788] acpiphp: Slot [75] registered
[    7.943611] acpiphp: Slot [76] registered
[    7.946791] acpiphp: Slot [77] registered
[    7.947609] acpiphp: Slot [78] registered
[    7.950778] acpiphp: Slot [79] registered
[    7.951608] acpiphp: Slot [80] registered
[    7.954756] acpiphp: Slot [81] registered
[    7.955608] acpiphp: Slot [82] registered
[    7.958801] acpiphp: Slot [83] registered
[    7.959609] acpiphp: Slot [84] registered
[    7.962860] acpiphp: Slot [85] registered
[    7.963609] acpiphp: Slot [86] registered
[    7.966806] acpiphp: Slot [87] registered
[    7.967609] acpiphp: Slot [88] registered
[    7.970769] acpiphp: Slot [89] registered
[    7.971644] acpiphp: Slot [90] registered
[    7.971922] acpiphp: Slot [91] registered
[    7.974781] acpiphp: Slot [92] registered
[    7.975594] acpiphp: Slot [93] registered
[    7.983610] acpiphp: Slot [94] registered
[    7.986790] acpiphp: Slot [95] registered
[    7.987609] PCI host bridge to bus 0000:20
[    7.990768] pci_bus 0000:20: root bus resource [bus 20]
[    7.995598] pci_bus 0000:20: root bus resource [mem 0xc8000000-0xcbffffff window]
[    7.999597] pci_bus 0000:20: root bus resource [mem 0x3ac000000000-0x3af4e80fffff window]
[    8.007906] pci 0000:20:01.0: [1d0f:ec20] type 00 class 0x020000
[    8.013576] pci 0000:20:01.0: reg 0x10: [mem 0xca800000-0xca803fff]
[    8.023286] pci 0000:20:01.0: reg 0x18: [mem 0x3ad417c00000-0x3ad417ffffff 64bit pref]
[    8.031972] pci 0000:20:01.0: enabling Extended Tags
[    8.041973] pci 0000:20:1b.0: [1d0f:efa0] type 00 class 0x020000
[    8.049794] pci 0000:20:1b.0: reg 0x10: [mem 0xca804000-0xca807fff]
[    8.056017] pci 0000:20:1b.0: reg 0x18: [mem 0x3ad418000000-0x3ad41fffffff 64bit pref]
[    8.065857] pci 0000:20:1b.0: reg 0x20: [mem 0xca000000-0xca7fffff]
[    8.076022] pci 0000:20:1b.0: enabling Extended Tags
[    8.080361] pci 0000:20:1c.0: [10de:20b0] type 00 class 0x030200
[    8.137275] pci 0000:20:1c.0: reg 0x10: [mem 0xc8000000-0xc8ffffff]
[    8.157259] pci 0000:20:1c.0: reg 0x14: [mem 0x3ae000000000-0x3aefffffffff 64bit pref]
[    8.177252] pci 0000:20:1c.0: reg 0x1c: [mem 0x3af420000000-0x3af421ffffff 64bit pref]
[    8.213360] pci 0000:20:1c.0: Enabling HDA controller
[    8.221299] pci 0000:20:1c.0: PME# supported from D0 D3hot
[    8.224249] pci 0000:20:1d.0: [10de:20b0] type 00 class 0x030200
[    8.281324] pci 0000:20:1d.0: reg 0x10: [mem 0xc9000000-0xc9ffffff]
[    8.301268] pci 0000:20:1d.0: reg 0x14: [mem 0x3ac000000000-0x3acfffffffff 64bit pref]
[    8.321267] pci 0000:20:1d.0: reg 0x1c: [mem 0x3ad420000000-0x3ad421ffffff 64bit pref]
[    8.357357] pci 0000:20:1d.0: Enabling HDA controller
[    8.364049] pci 0000:20:1d.0: PME# supported from D0 D3hot
[    8.368185] pci 0000:20:1e.0: [1d0f:cd01] type 00 class 0x010802
[    8.372941] pci 0000:20:1e.0: reg 0x10: [mem 0xca808000-0xca80bfff]
[    8.377839] pci 0000:20:1e.0: reg 0x18: [mem 0x3ad417bfe000-0x3ad417bfffff 64bit pref]
[    8.387777] pci 0000:20:1f.0: [1d0f:cd01] type 00 class 0x010802
[    8.392873] pci 0000:20:1f.0: reg 0x10: [mem 0xca80c000-0xca80ffff]
[    8.402123] pci 0000:20:1f.0: reg 0x18: [mem 0x3ad417bfc000-0x3ad417bfdfff 64bit pref]
[    8.412104] pci_bus 0000:20: on NUMA node 0
[    8.412246] ACPI: PCI Root Bridge [PC03] (domain 0000 [bus 80])
[    8.415599] acpi PNP0A03:03: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    8.423611] acpi PNP0A03:03: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[    8.427857] acpiphp: Slot [96] registered
[    8.431038] acpiphp: Slot [97] registered
[    8.435609] acpiphp: Slot [98] registered
[    8.438793] acpiphp: Slot [99] registered
[    8.443610] acpiphp: Slot [100] registered
[    8.446812] acpiphp: Slot [101] registered
[    8.447609] acpiphp: Slot [102] registered
[    8.450832] acpiphp: Slot [103] registered
[    8.455609] acpiphp: Slot [104] registered
[    8.458836] acpiphp: Slot [105] registered
[    8.459610] acpiphp: Slot [106] registered
[    8.462832] acpiphp: Slot [107] registered
[    8.467609] acpiphp: Slot [108] registered
[    8.470840] acpiphp: Slot [109] registered
[    8.475610] acpiphp: Slot [110] registered
[    8.478792] acpiphp: Slot [111] registered
[    8.479609] acpiphp: Slot [112] registered
[    8.482822] acpiphp: Slot [113] registered
[    8.487610] acpiphp: Slot [114] registered
[    8.490812] acpiphp: Slot [115] registered
[    8.491610] acpiphp: Slot [116] registered
[    8.494825] acpiphp: Slot [117] registered
[    8.499609] acpiphp: Slot [118] registered
[    8.502824] acpiphp: Slot [119] registered
[    8.507613] acpiphp: Slot [120] registered
[    8.510864] acpiphp: Slot [121] registered
[    8.511610] acpiphp: Slot [122] registered
[    8.514862] acpiphp: Slot [123] registered
[    8.519609] acpiphp: Slot [124] registered
[    8.522818] acpiphp: Slot [125] registered
[    8.527609] acpiphp: Slot [126] registered
[    8.530830] acpiphp: Slot [127] registered
[    8.531605] PCI host bridge to bus 0000:80
[    8.534806] pci_bus 0000:80: root bus resource [bus 80]
[    8.539597] pci_bus 0000:80: root bus resource [mem 0xd4000000-0xdfffffff window]
[    8.549710] pci 0000:80:1a.0: [10de:1af1] type 00 class 0x068000
[    8.591203] pci 0000:80:1a.0: reg 0x10: [mem 0xd4000000-0xd5ffffff]
[    8.615916] pci 0000:80:1a.0: PME# supported from D0 D3hot
[    8.620337] pci 0000:80:1b.0: [10de:1af1] type 00 class 0x068000
[    8.663152] pci 0000:80:1b.0: reg 0x10: [mem 0xd6000000-0xd7ffffff]
[    8.688528] pci 0000:80:1b.0: PME# supported from D0 D3hot
[    8.692345] pci 0000:80:1c.0: [10de:1af1] type 00 class 0x068000
[    8.731234] pci 0000:80:1c.0: reg 0x10: [mem 0xd8000000-0xd9ffffff]
[    8.756542] pci 0000:80:1c.0: PME# supported from D0 D3hot
[    8.760343] pci 0000:80:1d.0: [10de:1af1] type 00 class 0x068000
[    8.807351] pci 0000:80:1d.0: reg 0x10: [mem 0xda000000-0xdbffffff]
[    8.832535] pci 0000:80:1d.0: PME# supported from D0 D3hot
[    8.836345] pci 0000:80:1e.0: [10de:1af1] type 00 class 0x068000
[    8.859278] pci 0000:80:1e.0: reg 0x10: [mem 0xdc000000-0xddffffff]
[    8.879997] pci 0000:80:1e.0: PME# supported from D0 D3hot
[    8.884345] pci 0000:80:1f.0: [10de:1af1] type 00 class 0x068000
[    8.927242] pci 0000:80:1f.0: reg 0x10: [mem 0xde000000-0xdfffffff]
[    8.952546] pci 0000:80:1f.0: PME# supported from D0 D3hot
[    8.956198] pci_bus 0000:80: on NUMA node 1
[    8.956329] ACPI: PCI Root Bridge [PC04] (domain 0000 [bus 90])
[    8.959599] acpi PNP0A03:04: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    8.967611] acpi PNP0A03:04: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[    8.975866] acpiphp: Slot [128] registered
[    8.979093] acpiphp: Slot [129] registered
[    8.979612] acpiphp: Slot [130] registered
[    8.982794] acpiphp: Slot [131] registered
[    8.987609] acpiphp: Slot [132] registered
[    8.990825] acpiphp: Slot [133] registered
[    8.995611] acpiphp: Slot [134] registered
[    8.998801] acpiphp: Slot [135] registered
[    8.999611] acpiphp: Slot [136] registered
[    9.002809] acpiphp: Slot [137] registered
[    9.007609] acpiphp: Slot [138] registered
[    9.010916] acpiphp: Slot [139] registered
[    9.015610] acpiphp: Slot [140] registered
[    9.018937] acpiphp: Slot [141] registered
[    9.019611] acpiphp: Slot [142] registered
[    9.022839] acpiphp: Slot [143] registered
[    9.027609] acpiphp: Slot [144] registered
[    9.030866] acpiphp: Slot [145] registered
[    9.031609] acpiphp: Slot [146] registered
[    9.034839] acpiphp: Slot [147] registered
[    9.039609] acpiphp: Slot [148] registered
[    9.042817] acpiphp: Slot [149] registered
[    9.047611] acpiphp: Slot [150] registered
[    9.050859] acpiphp: Slot [151] registered
[    9.051609] acpiphp: Slot [152] registered
[    9.054824] acpiphp: Slot [153] registered
[    9.059610] acpiphp: Slot [154] registered
[    9.062839] acpiphp: Slot [155] registered
[    9.063609] acpiphp: Slot [156] registered
[    9.066888] acpiphp: Slot [157] registered
[    9.071616] acpiphp: Slot [158] registered
[    9.074825] acpiphp: Slot [159] registered
[    9.079606] PCI host bridge to bus 0000:90
[    9.082813] pci_bus 0000:90: root bus resource [bus 90]
[    9.083597] pci_bus 0000:90: root bus resource [mem 0xcc000000-0xcfffffff window]
[    9.091597] pci_bus 0000:90: root bus resource [mem 0x3ec000000000-0x3ef4e7ffffff window]
[    9.095910] pci 0000:90:01.0: [1d0f:ec20] type 00 class 0x020000
[    9.105593] pci 0000:90:01.0: reg 0x10: [mem 0xce800000-0xce803fff]
[    9.111119] pci 0000:90:01.0: reg 0x18: [mem 0x3ed417c00000-0x3ed417ffffff 64bit pref]
[    9.124001] pci 0000:90:01.0: enabling Extended Tags
[    9.133964] pci 0000:90:1b.0: [1d0f:efa0] type 00 class 0x020000
[    9.138111] pci 0000:90:1b.0: reg 0x10: [mem 0xce804000-0xce807fff]
[    9.148040] pci 0000:90:1b.0: reg 0x18: [mem 0x3ed418000000-0x3ed41fffffff 64bit pref]
[    9.154072] pci 0000:90:1b.0: reg 0x20: [mem 0xce000000-0xce7fffff]
[    9.164046] pci 0000:90:1b.0: enabling Extended Tags
[    9.168359] pci 0000:90:1c.0: [10de:20b0] type 00 class 0x030200
[    9.225362] pci 0000:90:1c.0: reg 0x10: [mem 0xcc000000-0xccffffff]
[    9.245352] pci 0000:90:1c.0: reg 0x14: [mem 0x3ee000000000-0x3eefffffffff 64bit pref]
[    9.269348] pci 0000:90:1c.0: reg 0x1c: [mem 0x3ef420000000-0x3ef421ffffff 64bit pref]
[    9.397440] pci 0000:90:1c.0: Enabling HDA controller
[    9.404052] pci 0000:90:1c.0: PME# supported from D0 D3hot
[    9.408257] pci 0000:90:1d.0: [10de:20b0] type 00 class 0x030200
[    9.467598] pci 0000:90:1d.0: reg 0x10: [mem 0xcd000000-0xcdffffff]
[    9.487600] pci 0000:90:1d.0: reg 0x14: [mem 0x3ec000000000-0x3ecfffffffff 64bit pref]
[    9.509278] pci 0000:90:1d.0: reg 0x1c: [mem 0x3ed420000000-0x3ed421ffffff 64bit pref]
[    9.545374] pci 0000:90:1d.0: Enabling HDA controller
[    9.548056] pci 0000:90:1d.0: PME# supported from D0 D3hot
[    9.552189] pci 0000:90:1e.0: [1d0f:cd01] type 00 class 0x010802
[    9.556759] pci 0000:90:1e.0: reg 0x10: [mem 0xce808000-0xce80bfff]
[    9.561651] pci 0000:90:1e.0: reg 0x18: [mem 0x3ed417bfe000-0x3ed417bfffff 64bit pref]
[    9.567396] pci 0000:90:1f.0: [1d0f:cd01] type 00 class 0x010802
[    9.568646] pci 0000:90:1f.0: reg 0x10: [mem 0xce80c000-0xce80ffff]
[    9.573720] pci 0000:90:1f.0: reg 0x18: [mem 0x3ed417bfc000-0x3ed417bfdfff 64bit pref]
[    9.579194] pci_bus 0000:90: on NUMA node 1
[    9.579336] ACPI: PCI Root Bridge [PC05] (domain 0000 [bus a0])
[    9.579599] acpi PNP0A03:05: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    9.583611] acpi PNP0A03:05: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
[    9.587867] acpiphp: Slot [160] registered
[    9.591068] acpiphp: Slot [161] registered
[    9.591610] acpiphp: Slot [162] registered
[    9.594800] acpiphp: Slot [163] registered
[    9.595611] acpiphp: Slot [164] registered
[    9.598834] acpiphp: Slot [165] registered
[    9.599609] acpiphp: Slot [166] registered
[    9.602829] acpiphp: Slot [167] registered
[    9.603608] acpiphp: Slot [168] registered
[    9.606815] acpiphp: Slot [169] registered
[    9.607610] acpiphp: Slot [170] registered
[    9.610815] acpiphp: Slot [171] registered
[    9.611609] acpiphp: Slot [172] registered
[    9.614823] acpiphp: Slot [173] registered
[    9.615608] acpiphp: Slot [174] registered
[    9.618876] acpiphp: Slot [175] registered
[    9.619610] acpiphp: Slot [176] registered
[    9.622853] acpiphp: Slot [177] registered
[    9.623608] acpiphp: Slot [178] registered
[    9.626901] acpiphp: Slot [179] registered
[    9.627610] acpiphp: Slot [180] registered
[    9.630867] acpiphp: Slot [181] registered
[    9.631613] acpiphp: Slot [182] registered
[    9.634830] acpiphp: Slot [183] registered
[    9.635608] acpiphp: Slot [184] registered
[    9.638835] acpiphp: Slot [185] registered
[    9.639609] acpiphp: Slot [186] registered
[    9.642832] acpiphp: Slot [187] registered
[    9.643609] acpiphp: Slot [188] registered
[    9.646924] acpiphp: Slot [189] registered
[    9.647609] acpiphp: Slot [190] registered
[    9.650796] acpiphp: Slot [191] registered
[    9.651605] PCI host bridge to bus 0000:a0
[    9.654769] pci_bus 0000:a0: root bus resource [bus a0]
[    9.655597] pci_bus 0000:a0: root bus resource [mem 0xd0000000-0xd3ffffff window]
[    9.659597] pci_bus 0000:a0: root bus resource [mem 0x3fc000000000-0x3ff4e7ffffff window]
[    9.663905] pci 0000:a0:01.0: [1d0f:ec20] type 00 class 0x020000
[    9.669697] pci 0000:a0:01.0: reg 0x10: [mem 0xd2800000-0xd2803fff]
[    9.675161] pci 0000:a0:01.0: reg 0x18: [mem 0x3fd417c00000-0x3fd417ffffff 64bit pref]
[    9.679998] pci 0000:a0:01.0: enabling Extended Tags
[    9.689930] pci 0000:a0:1b.0: [1d0f:efa0] type 00 class 0x020000
[    9.694137] pci 0000:a0:1b.0: reg 0x10: [mem 0xd2804000-0xd2807fff]
[    9.700031] pci 0000:a0:1b.0: reg 0x18: [mem 0x3fd418000000-0x3fd41fffffff 64bit pref]
[    9.706015] pci 0000:a0:1b.0: reg 0x20: [mem 0xd2000000-0xd27fffff]
[    9.712057] pci 0000:a0:1b.0: enabling Extended Tags
[    9.716331] pci 0000:a0:1c.0: [10de:20b0] type 00 class 0x030200
[    9.729373] pci 0000:a0:1c.0: reg 0x10: [mem 0xd0000000-0xd0ffffff]
[    9.737368] pci 0000:a0:1c.0: reg 0x14: [mem 0x3fe000000000-0x3fefffffffff 64bit pref]
[    9.745372] pci 0000:a0:1c.0: reg 0x1c: [mem 0x3ff420000000-0x3ff421ffffff 64bit pref]
[    9.757491] pci 0000:a0:1c.0: Enabling HDA controller
[    9.760053] pci 0000:a0:1c.0: PME# supported from D0 D3hot
[    9.764254] pci 0000:a0:1d.0: [10de:20b0] type 00 class 0x030200
[    9.781298] pci 0000:a0:1d.0: reg 0x10: [mem 0xd1000000-0xd1ffffff]
[    9.789299] pci 0000:a0:1d.0: reg 0x14: [mem 0x3fc000000000-0x3fcfffffffff 64bit pref]
[    9.797310] pci 0000:a0:1d.0: reg 0x1c: [mem 0x3fd420000000-0x3fd421ffffff 64bit pref]
[    9.809400] pci 0000:a0:1d.0: Enabling HDA controller
[    9.813224] pci 0000:a0:1d.0: PME# supported from D0 D3hot
[    9.816184] pci 0000:a0:1e.0: [1d0f:cd01] type 00 class 0x010802
[    9.820746] pci 0000:a0:1e.0: reg 0x10: [mem 0xd2808000-0xd280bfff]
[    9.825540] pci 0000:a0:1e.0: reg 0x18: [mem 0x3fd417bfe000-0x3fd417bfffff 64bit pref]
[    9.831302] pci 0000:a0:1f.0: [1d0f:cd01] type 00 class 0x010802
[    9.832549] pci 0000:a0:1f.0: reg 0x10: [mem 0xd280c000-0xd280ffff]
[    9.837573] pci 0000:a0:1f.0: reg 0x18: [mem 0x3fd417bfc000-0x3fd417bfdfff 64bit pref]
[    9.843274] pci_bus 0000:a0: on NUMA node 1
[    9.843472] ACPI: PCI: Interrupt link LNKA configured for IRQ 10
[    9.847674] ACPI: PCI: Interrupt link LNKB configured for IRQ 10
[    9.851734] ACPI: PCI: Interrupt link LNKC configured for IRQ 11
[    9.859667] ACPI: PCI: Interrupt link LNKD configured for IRQ 11
[    9.863753] ACPI: PCI: Interrupt link LNKS configured for IRQ 9
[    9.876443] iommu: Default domain type: Translated 
[    9.879831] SCSI subsystem initialized
[    9.882939] libata version 3.00 loaded.
[    9.882939] pci 0000:00:03.0: vgaarb: setting as boot VGA device
[    9.883551] pci 0000:00:03.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    9.883602] pci 0000:00:03.0: vgaarb: bridge control possible
[    9.887442] vgaarb: loaded
[    9.887611] ACPI: bus type USB registered
[    9.890803] usbcore: registered new interface driver usbfs
[    9.891601] usbcore: registered new interface driver hub
[    9.895310] usbcore: registered new device driver usb
[    9.895611] pps_core: LinuxPPS API ver. 1 registered
[    9.899154] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    9.899598] PTP clock support registered
[    9.902770] EDAC MC: Ver: 3.0.0
[    9.905738] NetLabel: Initializing
[    9.907597] NetLabel:  domain hash size = 128
[    9.910877] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    9.911607] NetLabel:  unlabeled traffic allowed by default
[    9.915360] PCI: Using ACPI for IRQ routing
[    9.915597] PCI: pci_cache_line_size set to 64 bytes
[    9.916073] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
[    9.916075] e820: reserve RAM buffer [mem 0x7ffe2000-0x7fffffff]
[    9.916423] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[    9.919597] hpet0: 8 comparators, 32-bit 62.500000 MHz counter
[    9.926869] clocksource: Switched to clocksource kvm-clock
[    9.938840] VFS: Disk quotas dquot_6.6.0
[    9.942145] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    9.946610] AppArmor: AppArmor Filesystem Enabled
[    9.950056] pnp: PnP ACPI init
[    9.952890] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
[    9.952914] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
[    9.952931] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
[    9.952975] pnp 00:03: Plug and Play ACPI device, IDs PNP0400 (active)
[    9.953017] pnp 00:04: Plug and Play ACPI device, IDs PNP0501 (active)
[    9.953371] pnp: PnP ACPI: found 5 devices
[    9.962407] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    9.969094] NET: Registered protocol family 2
[    9.972856] IP idents hash table entries: 262144 (order: 9, 2097152 bytes, vmalloc)
[    9.981946] tcp_listen_portaddr_hash hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc)
[    9.989500] TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)
[    9.996376] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc)
[   10.002390] TCP: Hash tables configured (established 524288 bind 65536)
[   10.007100] MPTCP token hash table entries: 65536 (order: 8, 1572864 bytes, vmalloc)
[   10.013809] UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
[   10.018762] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
[   10.025131] NET: Registered protocol family 1
[   10.028418] NET: Registered protocol family 44
[   10.031765] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7 window]
[   10.035757] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff window]
[   10.039724] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[   10.043963] pci_bus 0000:00: resource 7 [mem 0xc0000000-0xc3ffffff window]
[   10.048227] pci_bus 0000:10: resource 4 [mem 0xc4000000-0xc7ffffff window]
[   10.052469] pci_bus 0000:10: resource 5 [mem 0x39c000000000-0x39f4e80fffff window]
[   10.058538] pci_bus 0000:20: resource 4 [mem 0xc8000000-0xcbffffff window]
[   10.062764] pci_bus 0000:20: resource 5 [mem 0x3ac000000000-0x3af4e80fffff window]
[   10.068841] pci_bus 0000:80: resource 4 [mem 0xd4000000-0xdfffffff window]
[   10.073099] pci_bus 0000:90: resource 4 [mem 0xcc000000-0xcfffffff window]
[   10.077336] pci_bus 0000:90: resource 5 [mem 0x3ec000000000-0x3ef4e7ffffff window]
[   10.083370] pci_bus 0000:a0: resource 4 [mem 0xd0000000-0xd3ffffff window]
[   10.087613] pci_bus 0000:a0: resource 5 [mem 0x3fc000000000-0x3ff4e7ffffff window]
[   10.093661] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[   10.097567] pci 0000:00:01.0: Activating ISA DMA hang workarounds
[   10.103150] PCI: CLS 32 bytes, default 64
[   10.106327] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[   10.106377] Trying to unpack rootfs image as initramfs...
[   10.110420] software IO TLB: mapped [mem 0x000000007bfe2000-0x000000007ffe2000] (64MB)
[   10.110487] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2b3e43c8763, max_idle_ns: 440795360101 ns
[   10.129067] clocksource: Switched to clocksource tsc
[   10.133693] Initialise system trusted keyrings
[   10.137062] Key type blacklist registered
[   10.140299] workingset: timestamp_bits=36 max_order=29 bucket_order=0
[   10.145311] zbud: loaded
[   10.241271] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[   10.245443] fuse: init (API version 7.34)
[   10.249091] integrity: Platform Keyring initialized
[   10.258861] Freeing initrd memory: 97704K
[   10.258883] Key type asymmetric registered
[   10.265258] Asymmetric key parser 'x509' registered
[   10.268745] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 243)
[   10.274832] io scheduler mq-deadline registered
[   10.279661] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[   10.288104] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[   10.293978] ACPI: button: Power Button [PWRF]
[   10.297232] input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
[   10.302935] ACPI: button: Sleep Button [SLPF]
[   10.308443] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[   10.338328] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[   10.344745] Linux agpgart interface v0.103
[   10.495684] loop: module loaded
[   10.499384] nvme nvme0: pci function 0000:00:04.0
[   10.502871] nvme nvme1: pci function 0000:00:1f.0
[   10.506426] nvme nvme2: pci function 0000:10:1e.0
[   10.510016] nvme nvme3: pci function 0000:10:1f.0
[   10.513058] nvme nvme0: 2/0/0 default/read/poll queues
[   10.513598] nvme nvme4: pci function 0000:20:1e.0
[   10.514017] nvme nvme1: 2/0/0 default/read/poll queues
[   10.515567]  nvme1n1: p1
[   10.522833]  nvme0n1: p1 p128
[   10.524378] nvme nvme2: 31/0/0 default/read/poll queues
[   10.524500] nvme nvme5: pci function 0000:20:1f.0
[   10.527448] nvme nvme3: 31/0/0 default/read/poll queues
[   10.530241] nvme nvme6: pci function 0000:90:1e.0
[   10.540680] nvme nvme4: 31/0/0 default/read/poll queues
[   10.544012] nvme nvme7: pci function 0000:90:1f.0
[   10.549469] nvme nvme5: 31/0/0 default/read/poll queues
[   10.551655] nvme nvme8: pci function 0000:a0:1e.0
[   10.558691] nvme nvme9: pci function 0000:a0:1f.0
[   10.562593] tun: Universal TUN/TAP device driver, 1.6
[   10.566629] PPP generic driver version 2.4.2
[   10.570060] VFIO - User Level meta-driver version: 0.3
[   10.573946] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[   10.574648] nvme nvme6: 31/0/0 default/read/poll queues
[   10.578358] ehci-pci: EHCI PCI platform driver
[   10.579555] nvme nvme7: 31/0/0 default/read/poll queues
[   10.580641] nvme nvme8: 31/0/0 default/read/poll queues
[   10.584822] nvme nvme9: 31/0/0 default/read/poll queues
[   10.585570] ehci-platform: EHCI generic platform driver
[   10.599759] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[   10.603676] ohci-pci: OHCI PCI platform driver
[   10.606945] ohci-platform: OHCI generic platform driver
[   10.610491] uhci_hcd: USB Universal Host Controller Interface driver
[   10.614415] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[   10.620620] i8042: Warning: Keylock active
[   10.625013] serio: i8042 KBD port at 0x60,0x64 irq 1
[   10.628511] serio: i8042 AUX port at 0x60,0x64 irq 12
[   10.632017] mousedev: PS/2 mouse device common for all mice
[   10.635848] rtc_cmos 00:00: RTC can wake from S4
[   10.640113] rtc_cmos 00:00: registered as rtc0
[   10.643583] rtc_cmos 00:00: setting system clock to 2022-07-28T07:10:09 UTC (1658992209)
[   10.649716] rtc_cmos 00:00: alarms up to one day, 114 bytes nvram
[   10.653517] i2c /dev entries driver
[   10.656412] device-mapper: uevent: version 1.0.3
[   10.659739] device-mapper: ioctl: 4.45.0-ioctl (2021-03-22) initialised: dm-devel@redhat.com
[   10.665940] platform eisa.0: Probing EISA bus 0
[   10.669197] platform eisa.0: EISA: Cannot allocate resource for mainboard
[   10.673259] platform eisa.0: Cannot allocate resource for EISA slot 1
[   10.677166] platform eisa.0: Cannot allocate resource for EISA slot 2
[   10.681058] platform eisa.0: Cannot allocate resource for EISA slot 3
[   10.684963] platform eisa.0: Cannot allocate resource for EISA slot 4
[   10.688876] platform eisa.0: Cannot allocate resource for EISA slot 5
[   10.692817] platform eisa.0: Cannot allocate resource for EISA slot 6
[   10.696713] platform eisa.0: Cannot allocate resource for EISA slot 7
[   10.700623] platform eisa.0: Cannot allocate resource for EISA slot 8
[   10.704507] platform eisa.0: EISA: Detected 0 cards
[   10.707836] intel_pstate: P-states controlled by the platform
[   10.716839] ledtrig-cpu: registered to indicate activity on CPUs
[   10.720867] drop_monitor: Initializing network drop monitor service
[   10.724829] NET: Registered protocol family 10
[   10.733828] Segment Routing with IPv6
[   10.736905] NET: Registered protocol family 17
[   10.740243] Key type dns_resolver registered
[   10.756768] No MBM correction factor available
[   10.760111] IPI shorthand broadcast: enabled
[   10.763266] sched_clock: Marking stable (9579648027, 1180458248)->(11937101814, -1176995539)
[   10.772450] registered taskstats version 1
[   10.775586] Loading compiled-in X.509 certificates
[   10.779634] Loaded X.509 cert 'Build time autogenerated kernel key: 1c87debd80b0db7d2d960450056c96567636ad46'
[   10.786719] Loaded X.509 cert 'Canonical Ltd. Live Patch Signing: 14df34d1a87cf37625abec039ef2bf521249b969'
[   10.793768] Loaded X.509 cert 'Canonical Ltd. Kernel Module Signing: 88f752e560a1e0737e31163a466ad7b70a850c19'
[   10.800401] blacklist: Loading compiled-in revocation X.509 certificates
[   10.804501] Loaded X.509 cert 'Canonical Ltd. Secure Boot Signing: 61482aa2830d0ab2ad5af10b7250da9033ddcef0'
[   10.819215] zswap: loaded using pool lzo/zbud
[   10.823050] Key type ._fscrypt registered
[   10.826138] Key type .fscrypt registered
[   10.829151] Key type fscrypt-provisioning registered
[   10.836462] Key type encrypted registered
[   10.839627] AppArmor: AppArmor sha1 policy hashing enabled
[   10.843239] ima: No TPM chip found, activating TPM-bypass!
[   10.846890] Loading compiled-in module X.509 certificates
[   10.850821] Loaded X.509 cert 'Build time autogenerated kernel key: 1c87debd80b0db7d2d960450056c96567636ad46'
[   10.857591] ima: Allocated hash algorithm: sha1
[   10.860848] ima: No architecture policies found
[   10.864083] evm: Initialising EVM extended attributes:
[   10.867572] evm: security.selinux
[   10.870411] evm: security.SMACK64
[   10.873222] evm: security.SMACK64EXEC
[   10.876090] evm: security.SMACK64TRANSMUTE
[   10.879199] evm: security.SMACK64MMAP
[   10.882150] evm: security.apparmor
[   10.885011] evm: security.ima
[   10.887665] evm: security.capability
[   10.890563] evm: HMAC attrs: 0x1
[   10.893697] PM:   Magic number: 14:467:168
[   10.897090] acpi device:31: hash matches
[   10.900174] memory memory8796: hash matches
[   10.903414] memory memory8115: hash matches
[   10.906723] memory memory7024: hash matches
[   10.909859] memory memory6869: hash matches
[   10.913213] memory memory5583: hash matches
[   10.916471] memory memory4747: hash matches
[   10.919749] memory memory3461: hash matches
[   10.923054] memory memory2370: hash matches
[   10.926332] memory memory1534: hash matches
[   10.929486] memory memory1088: hash matches
[   10.932613] memory memory747: hash matches
[   10.951521] RAS: Correctable Errors collector initialized.
[   10.957517] Freeing unused decrypted memory: 2036K
[   10.961744] Freeing unused kernel image (initmem) memory: 2896K
[   10.999638] Write protecting the kernel read-only data: 30720k
[   11.004401] Freeing unused kernel image (text/rodata gap) memory: 2036K
[   11.009183] Freeing unused kernel image (rodata/data gap) memory: 1756K
[   11.085472] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[   11.089370] x86/mm: Checking user space page tables
[   11.154485] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[   11.158361] Run /init as init process
[   11.161284]   with arguments:
[   11.161285]     /init
[   11.161286]   with environment:
[   11.161287]     HOME=/
[   11.161288]     TERM=linux
[   11.161288]     BOOT_IMAGE=/boot/vmlinuz-5.13.0-1023-aws
[   11.167986] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input2
[   11.703789] cryptd: max_cpu_qlen set to 1000
[   11.704428] ena 0000:10:00.0: ENA device version: 0.10
[   11.711004] ena 0000:10:00.0: ENA controller version: 0.0.1 implementation version 1
[   11.728938] AVX2 version of gcm_enc/dec engaged.
[   11.731808] ena 0000:10:00.0: Elastic Network Adapter (ENA) found at mem c6800000, mac addr 02:60:79:9a:33:fb
[   11.740392] AES CTR mode by8 optimization enabled
[   11.740479] ena 0000:10:01.0: ENA device version: 0.10
[   11.747563] ena 0000:10:01.0: ENA controller version: 0.0.1 implementation version 1
[   11.756358] md127: detected capacity change from 0 to 14685380608
[   11.772286] ena 0000:10:01.0: Elastic Network Adapter (ENA) found at mem c6804000, mac addr 02:ea:29:2f:53:57
[   11.780381] ena 0000:20:01.0: ENA device version: 0.10
[   11.784156] ena 0000:20:01.0: ENA controller version: 0.0.1 implementation version 1
[   11.799250] ena 0000:20:01.0: Elastic Network Adapter (ENA) found at mem ca800000, mac addr 02:d0:ac:04:56:c1
[   11.806992] ena 0000:90:01.0: ENA device version: 0.10
[   11.810686] ena 0000:90:01.0: ENA controller version: 0.0.1 implementation version 1
[   11.814895] nvidia: loading out-of-tree module taints kernel.
[   11.814895] nvidia: loading out-of-tree module taints kernel.
[   11.814895] nvidia: loading out-of-tree module taints kernel.
[   11.814898] nvidia: loading out-of-tree module taints kernel.
[   11.814907] nvidia: module license 'NVIDIA' taints kernel.
[   11.814907] nvidia: module license 'NVIDIA' taints kernel.
[   11.814908] Disabling lock debugging due to kernel taint
[   11.832937] ena 0000:90:01.0: Elastic Network Adapter (ENA) found at mem ce800000, mac addr 02:3b:22:6c:86:a3
[   11.836646] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   11.840531] ena 0000:a0:01.0: ENA device version: 0.10
[   11.953774] ena 0000:a0:01.0: ENA controller version: 0.0.1 implementation version 1
[   11.964903] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[   11.968838] ena 0000:a0:01.0: Elastic Network Adapter (ENA) found at mem d2800000, mac addr 02:7d:ff:c7:46:5d

[   11.977967] nvidia-nvswitch: Probing device 0000:80:1a.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000 
[   11.987381] nvidia-nvswitch 0000:80:1a.0: can't derive routing for PCI INT A
[   11.988891] ena 0000:10:01.0 ens33: renamed from eth1
[   11.991671] nvidia-nvswitch 0000:80:1a.0: PCI INT A: no GSI - using ISA IRQ 10
[   12.312689] nvidia-nvswitch0: using MSI
[   12.512211] nvidia-nvswitch: Probing device 0000:80:1b.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000 
[   12.519211] nvidia-nvswitch 0000:80:1b.0: can't derive routing for PCI INT A
[   12.523442] nvidia-nvswitch 0000:80:1b.0: PCI INT A: no GSI - using ISA IRQ 11
[   12.523894] ena 0000:10:00.0 ens32: renamed from eth0
[   12.585674] input: ImPS/2 Generic Wheel Mouse as /devices/platform/i8042/serio1/input/input4
[   12.803749] ena 0000:20:01.0 ens65: renamed from eth2
[   12.856924] nvidia-nvswitch1: using MSI
[   13.057462] nvidia-nvswitch: Probing device 0000:80:1c.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000 
[   13.064467] nvidia-nvswitch 0000:80:1c.0: can't derive routing for PCI INT A
[   13.068705] nvidia-nvswitch 0000:80:1c.0: PCI INT A: no GSI - using ISA IRQ 11
[   13.107884] ena 0000:90:01.0 ens129: renamed from eth3
[   13.411286] nvidia-nvswitch2: using MSI
[   13.653383] nvidia-nvswitch: Probing device 0000:80:1d.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000 
[   13.660431] nvidia-nvswitch 0000:80:1d.0: can't derive routing for PCI INT A
[   13.664682] nvidia-nvswitch 0000:80:1d.0: PCI INT A: no GSI - using ISA IRQ 10
[   13.676367] ena 0000:a0:01.0 ens161: renamed from eth4
[   14.006394] nvidia-nvswitch3: using MSI
[   14.207517] nvidia-nvswitch: Probing device 0000:80:1e.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000 
[   14.214864] nvidia-nvswitch 0000:80:1e.0: can't derive routing for PCI INT A
[   14.219108] nvidia-nvswitch 0000:80:1e.0: PCI INT A: no GSI - using ISA IRQ 10
[   14.558065] nvidia-nvswitch4: using MSI
[   14.760095] nvidia-nvswitch: Probing device 0000:80:1f.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000 
[   14.767383] nvidia-nvswitch 0000:80:1f.0: can't derive routing for PCI INT A
[   14.771684] nvidia-nvswitch 0000:80:1f.0: PCI INT A: no GSI - using ISA IRQ 11
[   15.111720] nvidia-nvswitch5: using MSI
[   15.314290] nvidia 0000:10:1c.0: can't derive routing for PCI INT A
[   15.318243] nvidia 0000:10:1c.0: PCI INT A: no GSI - using ISA IRQ 11
[   15.372373] nvidia 0000:10:1d.0: can't derive routing for PCI INT A
[   15.376381] nvidia 0000:10:1d.0: PCI INT A: no GSI - using ISA IRQ 10
[   15.426467] nvidia 0000:20:1c.0: can't derive routing for PCI INT A
[   15.430409] nvidia 0000:20:1c.0: PCI INT A: no GSI - using ISA IRQ 11
[   15.481000] nvidia 0000:20:1d.0: can't derive routing for PCI INT A
[   15.485009] nvidia 0000:20:1d.0: PCI INT A: no GSI - using ISA IRQ 10
[   15.534587] nvidia 0000:90:1c.0: can't derive routing for PCI INT A
[   15.538926] nvidia 0000:90:1c.0: PCI INT A: no GSI - using ISA IRQ 11
[   15.588587] nvidia 0000:90:1d.0: can't derive routing for PCI INT A
[   15.592611] nvidia 0000:90:1d.0: PCI INT A: no GSI - using ISA IRQ 10
[   15.646342] nvidia 0000:a0:1c.0: can't derive routing for PCI INT A
[   15.650261] nvidia 0000:a0:1c.0: PCI INT A: no GSI - using ISA IRQ 11
[   15.700342] nvidia 0000:a0:1d.0: can't derive routing for PCI INT A
[   15.704368] nvidia 0000:a0:1d.0: PCI INT A: no GSI - using ISA IRQ 10
[   15.751639] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.73.08  Wed May 18 20:34:14 UTC 2022
[   15.762982] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  510.73.08  Wed May 18 20:27:26 UTC 2022
[   15.771992] [drm] [nvidia-drm] [GPU ID 0x0000101c] Loading driver
[   15.775982] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:10:1c.0 on minor 0
[   15.782159] [drm] [nvidia-drm] [GPU ID 0x0000101d] Loading driver
[   15.786077] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:10:1d.0 on minor 1
[   15.792033] [drm] [nvidia-drm] [GPU ID 0x0000201c] Loading driver
[   15.795933] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:20:1c.0 on minor 2
[   15.802281] [drm] [nvidia-drm] [GPU ID 0x0000201d] Loading driver
[   15.806142] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:20:1d.0 on minor 3
[   15.812058] [drm] [nvidia-drm] [GPU ID 0x0000901c] Loading driver
[   15.815967] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:90:1c.0 on minor 4
[   15.822050] [drm] [nvidia-drm] [GPU ID 0x0000901d] Loading driver
[   15.825959] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:90:1d.0 on minor 5
[   15.832019] [drm] [nvidia-drm] [GPU ID 0x0000a01c] Loading driver
[   15.835927] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:a0:1c.0 on minor 6
[   15.842002] [drm] [nvidia-drm] [GPU ID 0x0000a01d] Loading driver
[   15.845768] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:a0:1d.0 on minor 7
[   17.375598] raid6: avx512x4 gen() 286620 MB/s
[   17.423598] raid6: avx512x4 xor() 132062 MB/s
[   17.471598] raid6: avx512x2 gen() 286885 MB/s
[   17.519598] raid6: avx512x2 xor() 494582 MB/s
[   17.567600] raid6: avx512x1 gen() 287266 MB/s
[   17.615598] raid6: avx512x1 xor() 448854 MB/s
[   17.663599] raid6: avx2x4   gen() 286438 MB/s
[   17.711600] raid6: avx2x4   xor() 126087 MB/s
[   17.759599] raid6: avx2x2   gen() 286869 MB/s
[   17.807597] raid6: avx2x2   xor() 363353 MB/s
[   17.855598] raid6: avx2x1   gen() 218711 MB/s
[   17.903598] raid6: avx2x1   xor() 308254 MB/s
[   17.951600] raid6: sse2x4   gen() 195529 MB/s
[   17.999598] raid6: sse2x4   xor() 122240 MB/s
[   18.047599] raid6: sse2x2   gen() 211031 MB/s
[   18.095598] raid6: sse2x2   xor() 128693 MB/s
[   18.143600] raid6: sse2x1   gen() 194295 MB/s
[   18.191599] raid6: sse2x1   xor() 102450 MB/s
[   18.194844] raid6: using algorithm avx512x1 gen() 12696 MB/s
[   18.198536] raid6: .... xor() 448854 MB/s, rmw enabled
[   18.202001] raid6: using avx512x2 recovery algorithm
[   18.206547] xor: automatically using best checksumming function   avx       
[   18.212155] async_tx: api initialized (async)
[   18.286138] Btrfs loaded, crc32c=crc32c-intel, zoned=yes
[   18.372884] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[   18.632655] systemd[1]: Inserted module 'autofs4'
[   18.658125] systemd[1]: systemd 245.4-4ubuntu3.15 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
[   18.672705] systemd[1]: Detected virtualization kvm.
[   18.676222] systemd[1]: Detected architecture x86-64.
[   18.712678] systemd[1]: Set hostname to <ip-10-216-181-207>.
[   18.903191] systemd[1]: Configuration file /etc/systemd/system/ufw.service.d/override.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   18.966349] systemd[1]: Configuration file /etc/systemd/system/sensei-tags.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   18.978798] systemd[1]: Configuration file /etc/systemd/system/sensei-init-script-setup.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   18.991924] systemd[1]: Configuration file /etc/systemd/system/sensei-init-script.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.016129] systemd[1]: Configuration file /etc/systemd/system/sensei-fs-symlink.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.032760] systemd[1]: Configuration file /etc/systemd/system/process-exporter.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.053714] systemd[1]: Configuration file /etc/systemd/system/node_exporter.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.069726] systemd[1]: Configuration file /etc/systemd/system/mpproxy.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.082181] systemd[1]: Configuration file /etc/systemd/system/miniprom.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.096725] systemd[1]: Configuration file /etc/systemd/system/jupyterlab.service is marked executable. Please remove executable permission bits. Proceeding anyway.
[   19.108339] systemd[1]: Configuration file /etc/systemd/system/jupyterlab-setup-user.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.133870] systemd[1]: Configuration file /usr/lib/systemd/system/docker.service.d/users-permission-docker-socket.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.150268] systemd[1]: Configuration file /etc/systemd/system/dcgm_exporter.service is marked executable. Please remove executable permission bits. Proceeding anyway.
[   19.160867] systemd[1]: Configuration file /etc/systemd/system/dcgm_exporter.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.178047] systemd[1]: Configuration file /etc/systemd/system/sensei-init-script-started.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.193757] systemd[1]: Configuration file /etc/systemd/system/aws-mount-local-ssds.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[   19.273537] systemd[1]: Created slice system-modprobe.slice.
[   19.368068] systemd[1]: Created slice system-serial\x2dgetty.slice.
[   19.373431] systemd[1]: Created slice User and Session Slice.
[   19.378188] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[   19.385378] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[   19.392985] systemd[1]: Reached target Slices.
[   19.396981] systemd[1]: Reached target Swap.
[   19.400874] systemd[1]: Reached target System Time Set.
[   19.405298] systemd[1]: Listening on Device-mapper event daemon FIFOs.
[   19.410548] systemd[1]: Listening on LVM2 poll daemon socket.
[   19.415259] systemd[1]: Listening on multipathd control socket.
[   19.420165] systemd[1]: Listening on Syslog Socket.
[   19.424380] systemd[1]: Listening on fsck to fsckd communication Socket.
[   19.429613] systemd[1]: Listening on initctl Compatibility Named Pipe.
[   19.434849] systemd[1]: Listening on Journal Audit Socket.
[   19.439509] systemd[1]: Listening on Journal Socket (/dev/log).
[   19.444346] systemd[1]: Listening on Journal Socket.
[   19.448676] systemd[1]: Listening on Network Service Netlink Socket.
[   19.453786] systemd[1]: Listening on udev Control Socket.
[   19.458407] systemd[1]: Listening on udev Kernel Socket.
[   19.464055] systemd[1]: Mounting POSIX Message Queue File System...
[   19.470506] systemd[1]: Mounting Kernel Debug File System...
[   19.476356] systemd[1]: Mounting Kernel Trace File System...
[   19.482630] systemd[1]: Starting Journal Service...
[   19.487992] systemd[1]: Starting Elastic Fabric Adapter Configuration...
[   19.494597] systemd[1]: Starting Set the console keyboard layout...
[   19.500612] systemd[1]: Starting Create list of static device nodes for the current kernel...
[   19.509263] systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
[   19.517388] systemd[1]: Condition check resulted in Load Kernel Module drm being skipped.
[   19.523705] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
[   19.531584] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
[   19.538210] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
[   19.546951] systemd[1]: Starting Load Kernel Modules...
[   19.553099] systemd[1]: Starting Remount Root and Kernel File Systems...
[   19.559352] systemd[1]: Starting udev Coldplug all Devices...
[   19.563520] EXT4-fs (nvme0n1p1): re-mounted. Opts: discard. Quota mode: none.
[   19.566662] systemd[1]: Started Journal Service.
[   19.577529] IPMI message handler: version 39.2
[   19.584597] ipmi device interface
[   19.589723] systemd-journald[1342]: Received client request to flush runtime journal.
[   19.770073] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   19.775111] nvidia-uvm: Loaded the UVM driver, major device number 509.
[   20.285935] parport_pc 00:03: reported by Plug and Play ACPI
[   20.391337] ppdev: user-space parallel port driver
[   20.411899] efa 0000:10:1b.0: Setup irq:0x0000000014e0030e vector:457 name:efa-mgmnt@pci:0000:10:1b.0
[   20.417953] efa 0000:10:1b.0 efa_0: IB device registered
[   20.533296] efa 0000:20:1b.0: Setup irq:0x00000000cd6abbb1 vector:458 name:efa-mgmnt@pci:0000:20:1b.0
[   20.552824] efa 0000:20:1b.0 efa_1: IB device registered
[   20.664401] efa 0000:90:1b.0: Setup irq:0x00000000b2317bb1 vector:459 name:efa-mgmnt@pci:0000:90:1b.0
[   20.667181] efa 0000:90:1b.0 efa_2: IB device registered
[   20.776003] efa 0000:a0:1b.0: Setup irq:0x00000000033d5165 vector:460 name:efa-mgmnt@pci:0000:a0:1b.0
[   20.778647] efa 0000:a0:1b.0 efa_3: IB device registered
[   20.821481] Loading iSCSI transport class v2.0-870.
[   20.859415] iscsi: registered transport (iser)
[   22.065010] alua: device handler registered
[   22.066305] emc: device handler registered
[   22.068331] rdac: device handler registered
[   22.116965] loop0: detected capacity change from 0 to 51152
[   22.272003] SGI XFS with ACLs, security attributes, realtime, quota, no debug enabled
[   22.275133] XFS (nvme1n1p1): Mounting V5 Filesystem
[   22.368650] XFS (nvme1n1p1): Ending clean mount
[   22.388749] xfs filesystem being mounted at /var/tmp supports timestamps until 2038 (0x7fffffff)
[   22.423946] loop1: detected capacity change from 0 to 113792
[   22.579808] loop2: detected capacity change from 0 to 137712
[   22.651800] loop3: detected capacity change from 0 to 138880
[   22.707800] loop4: detected capacity change from 0 to 51416
[   22.755789] loop5: detected capacity change from 0 to 96176
[   22.907796] loop6: detected capacity change from 0 to 113736
[   22.939773] loop7: detected capacity change from 0 to 91496
[   23.139816] loop8: detected capacity change from 0 to 126824
[   23.527831] loop9: detected capacity change from 0 to 126888
[   23.694332] bpfilter: Loaded bpfilter_umh pid 2006
[   23.694557] Started bpfilter
[   23.696898] audit: type=1400 audit(1658992222.551:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=1997 comm="apparmor_parser"
[   23.697214] audit: type=1400 audit(1658992222.551:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1999 comm="apparmor_parser"
[   23.697221] audit: type=1400 audit(1658992222.551:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1999 comm="apparmor_parser"
[   23.697718] audit: type=1400 audit(1658992222.551:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=2001 comm="apparmor_parser"
[   23.697725] audit: type=1400 audit(1658992222.551:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=2001 comm="apparmor_parser"
[   23.697731] audit: type=1400 audit(1658992222.551:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=2001 comm="apparmor_parser"
[   23.700007] audit: type=1400 audit(1658992222.555:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" pid=2003 comm="apparmor_parser"
[   23.700492] audit: type=1400 audit(1658992222.555:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=1995 comm="apparmor_parser"
[   23.700499] audit: type=1400 audit(1658992222.555:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1995 comm="apparmor_parser"
[   23.702866] audit: type=1400 audit(1658992222.555:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/tcpdump" pid=2000 comm="apparmor_parser"
[   31.532159] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
[   31.644359] kauditd_printk_skb: 33 callbacks suppressed
[   31.644362] audit: type=1400 audit(1658992230.499:45): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/usr/share/sssd/cfg_rules.ini" pid=2317 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   31.789641] audit: type=1400 audit(1658992230.643:46): apparmor="ALLOWED" operation="file_lock" profile="/usr/sbin/sssd" name="/var/lib/sss/mc/passwd" pid=2494 comm="sssd_nss" requested_mask="k" denied_mask="k" fsuid=0 ouid=0
[   31.790121] audit: type=1400 audit(1658992230.643:47): apparmor="ALLOWED" operation="file_lock" profile="/usr/sbin/sssd" name="/var/lib/sss/mc/passwd" pid=2494 comm="sssd_nss" requested_mask="k" denied_mask="k" fsuid=0 ouid=0
[   31.797319] audit: type=1400 audit(1658992230.651:48): apparmor="ALLOWED" operation="file_lock" profile="/usr/sbin/sssd" name="/var/lib/sss/mc/group" pid=2494 comm="sssd_nss" requested_mask="k" denied_mask="k" fsuid=0 ouid=0
[   31.797374] audit: type=1400 audit(1658992230.651:49): apparmor="ALLOWED" operation="file_lock" profile="/usr/sbin/sssd" name="/var/lib/sss/mc/group" pid=2494 comm="sssd_nss" requested_mask="k" denied_mask="k" fsuid=0 ouid=0
[   31.802890] audit: type=1400 audit(1658992230.655:50): apparmor="ALLOWED" operation="file_lock" profile="/usr/sbin/sssd" name="/var/lib/sss/mc/initgroups" pid=2494 comm="sssd_nss" requested_mask="k" denied_mask="k" fsuid=0 ouid=0
[   31.802895] audit: type=1400 audit(1658992230.655:51): apparmor="ALLOWED" operation="file_lock" profile="/usr/sbin/sssd" name="/var/lib/sss/mc/initgroups" pid=2494 comm="sssd_nss" requested_mask="k" denied_mask="k" fsuid=0 ouid=0
[   31.900127] aufs 5.x-rcN-20210809
[   32.159729] audit: type=1400 audit(1658992231.015:52): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/2579/cmdline" pid=2494 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   32.426004] Lustre: Lustre: Build Version: 2.10.8
[   32.566500] LNet: 2732:0:(config.c:1637:lnet_inet_enumerate()) lnet: Ignoring interface ens33: it's down
[   32.566738] LNet: Added LNI 10.216.181.207@tcp [8/256/0/180]
[   32.566783] LNet: Accept secure, port 988
[   32.607391] audit: type=1400 audit(1658992231.459:53): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=2946 comm="apparmor_parser"
[   32.836335] Lustre: ggazbbmv: root_squash is set to 65534:65534
[   32.992725] Lustre: ggazbbmv: nosquash_nids set to 10.216.139.147@tcp 10.216.139.89@tcp 10.216.139.62@tcp 10.216.138.45@tcp 10.216.139.166@tcp 10.216.139.161@tcp 10.216.143.143@tcp 10.216.140.188@tcp 10.216.139.205@tcp *@tcp1 0@lo
[   33.096625] audit: type=1400 audit(1658992231.951:54): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/3044/cmdline" pid=2494 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   33.378933] EXT4-fs (md127): mounted filesystem without journal. Opts: (null). Quota mode: none.
[   33.411942] Lustre: Mounted ggazbbmv-client
[   33.530008] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   33.533601] Bridge firewalling registered
[   33.585962] Initializing XFRM netlink socket
[   35.323901] loop10: detected capacity change from 0 to 8
[   35.378685] docker0: port 1(veth8012af0) entered blocking state
[   35.378688] docker0: port 1(veth8012af0) entered disabled state
[   35.378749] device veth8012af0 entered promiscuous mode
[   35.419367] nvidia-nvswitch0: open (major=511)
[   35.419642] nvidia-nvswitch1: open (major=511)
[   35.419833] nvidia-nvswitch2: open (major=511)
[   35.420018] nvidia-nvswitch3: open (major=511)
[   35.420198] nvidia-nvswitch4: open (major=511)
[   35.420391] nvidia-nvswitch5: open (major=511)
[   35.635327] docker0: port 1(veth8012af0) entered disabled state
[   35.636652] device veth8012af0 left promiscuous mode
[   35.636656] docker0: port 1(veth8012af0) entered disabled state
[   37.137797] kauditd_printk_skb: 30 callbacks suppressed
[   37.137802] audit: type=1400 audit(1658992235.991:85): apparmor="DENIED" operation="ptrace" profile="/snap/snapd/16292/usr/lib/snapd/snap-confine" pid=3681 comm="ps" requested_mask="readby" denied_mask="readby" peer="snap.amazon-ssm-agent.amazon-ssm-agent"
[   37.146049] audit: type=1400 audit(1658992235.999:86): apparmor="DENIED" operation="ptrace" profile="/snap/snapd/16292/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=3681 comm="ps" requested_mask="readby" denied_mask="readby" peer="snap.amazon-ssm-agent.amazon-ssm-agent"
[   37.146059] audit: type=1400 audit(1658992235.999:87): apparmor="DENIED" operation="ptrace" profile="snap-update-ns.lxd" pid=3681 comm="ps" requested_mask="readby" denied_mask="readby" peer="snap.amazon-ssm-agent.amazon-ssm-agent"
[   42.786983] audit: type=1400 audit(1658992243.003:88): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/3818/cmdline" pid=2494 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=91223
[   47.655162] nvidia-modeset: ERROR: Failed to find GPU ID
[   58.797258] nvidia-nvlink: nvlink driver open
[   58.797263] nvidia-nvlink: nvlink driver close
[   58.797265] nvidia-nvlink: nvlink driver open
[   70.100775] nvidia-nvswitch0: open (major=511)
[   70.100804] nvidia-nvswitch0: open (major=511)
[   70.100814] nvidia-nvswitch0: open (major=511)
[   70.119720] nvidia-nvswitch0: open (major=511)
[   70.119748] nvidia-nvswitch1: open (major=511)
[   70.119766] nvidia-nvswitch1: open (major=511)
[   70.119775] nvidia-nvswitch1: open (major=511)
[   70.119784] nvidia-nvswitch1: open (major=511)
[   70.119794] nvidia-nvswitch2: open (major=511)
[   70.119803] nvidia-nvswitch2: open (major=511)
[   70.119812] nvidia-nvswitch2: open (major=511)
[   70.119822] nvidia-nvswitch2: open (major=511)
[   70.119832] nvidia-nvswitch3: open (major=511)
[   70.119842] nvidia-nvswitch3: open (major=511)
[   70.119851] nvidia-nvswitch3: open (major=511)
[   70.119860] nvidia-nvswitch3: open (major=511)
[   70.119881] nvidia-nvswitch4: open (major=511)
[   70.119884] nvidia-nvswitch4: open (major=511)
[   70.119888] nvidia-nvswitch4: open (major=511)
[   70.119891] nvidia-nvswitch4: open (major=511)
[   70.119895] nvidia-nvswitch5: open (major=511)
[   70.119898] nvidia-nvswitch5: open (major=511)
[   70.119901] nvidia-nvswitch5: open (major=511)
[   70.119904] nvidia-nvswitch5: open (major=511)
[   88.547686] python[4034]: segfault at 9 ip 00007fb57226fa24 sp 00007fb483ffede0 error 4 in libc-2.31.so[7fb5721fc000+178000]
[   88.547697] Code: c9 0f 11 4b 20 48 89 ee 66 48 0f 6e c0 48 83 ce 01 0f 16 44 24 08 48 89 73 08 0f 11 43 10 49 89 2c 24 48 85 d2 74 8f 48 89 d3 <48> 8b 43 08 89 c2 c1 ea 04 83 ea 02 49 8d 54 d7 10 49 39 d5 0f 85
[   88.561612] python[4041]: segfault at 9 ip 00007f00a78d7a24 sp 00007effb8268de0 error 4 in libc-2.31.so[7f00a7864000+178000]
[   88.561623] Code: c9 0f 11 4b 20 48 89 ee 66 48 0f 6e c0 48 83 ce 01 0f 16 44 24 08 48 89 73 08 0f 11 43 10 49 89 2c 24 48 85 d2 74 8f 48 89 d3 <48> 8b 43 08 89 c2 c1 ea 04 83 ea 02 49 8d 54 d7 10 49 39 d5 0f 85
[   97.583987] docker0: port 1(vetha1e8df7) entered blocking state
[   97.583993] docker0: port 1(vetha1e8df7) entered disabled state
[   97.584054] device vetha1e8df7 entered promiscuous mode
[   97.637097] docker0: port 1(vetha1e8df7) entered disabled state
[   97.638089] device vetha1e8df7 left promiscuous mode
[   97.638092] docker0: port 1(vetha1e8df7) entered disabled state
[  139.870984] audit: type=1400 audit(1658992340.091:89): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/4331/cmdline" pid=2498 comm="sssd_sudo" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[  139.875658] audit: type=1400 audit(1658992340.095:90): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/4331/cmdline" pid=2494 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[  139.900535] audit: type=1400 audit(1658992340.123:91): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/4331/cmdline" pid=2495 comm="sssd_pam" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[  155.035263] audit: type=1400 audit(1658992355.255:92): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/4340/cmdline" pid=2498 comm="sssd_sudo" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[  155.038672] audit: type=1400 audit(1658992355.259:93): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/4340/cmdline" pid=2494 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[  155.045388] audit: type=1400 audit(1658992355.267:94): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/4340/cmdline" pid=2495 comm="sssd_pam" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
taruntandon88 commented 1 year ago

Reading through libfabric code for this error, to me, it looks like the max locked memory set by the installers not getting honored. EFA installation created a file /etc/security/limits.d/efa.conf but whenever I run ulimit -l, I still get the system default value of 64.

I went ahead and made modifications to /etc/systemd/system.conf and /etc/systemd/user.conf to add DefaultLimitMEMLOCK=107374182400 After reboot, ulimit -l showed 104857600 and subsequently I was able to run some of these tests.

I there a reason why the configuration in /etc/security/limits.d/efa.conf is not getting honored?

wzamazon commented 1 year ago

Is there any other file under /etc/security/limits.d that has higher priority than efa.conf?

taruntandon88 commented 1 year ago

No. This is the only file under /etc/security/limits.d

rashikakheria commented 1 year ago

Did you log out and log back in after installing the EFA installer?

taruntandon88 commented 1 year ago

Yes. We reboot the machine after the installation is completed.

wzamazon commented 1 year ago

Hi,

Can you let us know what OS are you using?

taruntandon88 commented 1 year ago

Ubuntu 20.04 with CIS benchmarks applied for security hardening.

wzamazon commented 1 year ago

can you check whether you have a line like

session required pam_limits.so

in /etc/pam.d/system-auth ?

rashikakheria commented 1 year ago

@taruntandon88 Any updates on Wei's question?