NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.15k stars 794 forks source link

Cannot enable P2P communication in RTX3090 server #1127

Open leewxgit opened 9 months ago

leewxgit commented 9 months ago

Hello, I am working on testing various DNN models' training performance. I am using one server with 8 RTX3090 GPUs. The GPU interconnection is PCIe3.0x16. During my training experiments, I’ve noticed significant communication overhead. After investigating, I suspect the reseaon is that the P2P communication between GPUs cannot be enabled.

I am wondering: Does the RTX 3090 support P2P communication? Can you provide a definitive answer? I attempted to search online, and it seems that the RTX 4090 does not support P2P. However, I’m uncertain about the RTX 3090.

If 3090 can support P2P, how can I enable it?

The output of ./p2pBandwidthLatencyTest:

Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 1b, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 1c, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 1d, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 3090, pciBusID: 1e, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA GeForce RTX 3090, pciBusID: 3d, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA GeForce RTX 3090, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA GeForce RTX 3090, pciBusID: 40, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA GeForce RTX 3090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=0 CANNOT Access Peer Device=4
Device=0 CANNOT Access Peer Device=5
Device=0 CANNOT Access Peer Device=6
Device=0 CANNOT Access Peer Device=7
Device=1 CANNOT Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=1 CANNOT Access Peer Device=4
Device=1 CANNOT Access Peer Device=5
Device=1 CANNOT Access Peer Device=6
Device=1 CANNOT Access Peer Device=7
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=4
Device=2 CANNOT Access Peer Device=5
Device=2 CANNOT Access Peer Device=6
Device=2 CANNOT Access Peer Device=7
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CANNOT Access Peer Device=2
Device=3 CANNOT Access Peer Device=4
Device=3 CANNOT Access Peer Device=5
Device=3 CANNOT Access Peer Device=6
Device=3 CANNOT Access Peer Device=7
Device=4 CANNOT Access Peer Device=0
Device=4 CANNOT Access Peer Device=1
Device=4 CANNOT Access Peer Device=2
Device=4 CANNOT Access Peer Device=3
Device=4 CANNOT Access Peer Device=5
Device=4 CANNOT Access Peer Device=6
Device=4 CANNOT Access Peer Device=7
Device=5 CANNOT Access Peer Device=0
Device=5 CANNOT Access Peer Device=1
Device=5 CANNOT Access Peer Device=2
Device=5 CANNOT Access Peer Device=3
Device=5 CANNOT Access Peer Device=4
Device=5 CANNOT Access Peer Device=6
Device=5 CANNOT Access Peer Device=7
Device=6 CANNOT Access Peer Device=0
Device=6 CANNOT Access Peer Device=1
Device=6 CANNOT Access Peer Device=2
Device=6 CANNOT Access Peer Device=3
Device=6 CANNOT Access Peer Device=4
Device=6 CANNOT Access Peer Device=5
Device=6 CANNOT Access Peer Device=7
Device=7 CANNOT Access Peer Device=0
Device=7 CANNOT Access Peer Device=1
Device=7 CANNOT Access Peer Device=2
Device=7 CANNOT Access Peer Device=3
Device=7 CANNOT Access Peer Device=4
Device=7 CANNOT Access Peer Device=5
Device=7 CANNOT Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3     4     5     6     7
     0       1     0     0     0     0     0     0     0
     1       0     1     0     0     0     0     0     0
     2       0     0     1     0     0     0     0     0
     3       0     0     0     1     0     0     0     0
     4       0     0     0     0     1     0     0     0
     5       0     0     0     0     0     1     0     0
     6       0     0     0     0     0     0     1     0
     7       0     0     0     0     0     0     0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 831.56   5.76   5.74   5.76   5.83   5.81   5.81   5.79
     1   5.77 830.23   5.76   5.77   5.84   5.83   5.84   5.82
     2   5.78   5.76 832.45   5.78   5.85   5.83   5.84   5.81
     3   5.75   5.75   5.78 832.00   5.84   5.80   5.84   5.80
     4   5.82   5.82   5.84   5.85 833.78   5.76   5.76   5.76
     5   5.83   5.83   5.83   5.85   5.78 832.45   5.77   5.75
     6   5.84   5.83   5.81   5.85   5.78   5.77 832.89   5.76
     7   5.83   5.83   5.83   5.86   5.77   5.76   5.78 830.68
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 829.79   5.75   5.77   5.77   5.83   5.81   5.85   5.84
     1   5.78 832.45   5.75   5.76   5.84   5.83   5.82   5.84
     2   5.78   5.76 830.68   5.77   5.85   5.83   5.85   5.82
     3   5.76   5.77   5.77 832.00   5.84   5.83   5.83   5.83
     4   5.82   5.83   5.85   5.84 832.89   5.76   5.75   5.77
     5   5.83   5.82   5.83   5.84   5.78 833.78   5.76   5.76
     6   5.82   5.82   5.82   5.86   5.78   5.78 832.89   5.77
     7   5.82   5.83   5.84   5.84   5.77   5.77   5.78 832.00
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 837.58   6.13   6.12   6.13   8.57   8.58   8.60   8.59
     1   6.11 839.15   6.12   6.11   8.53   8.57   8.60   8.52
     2   6.13   6.12 837.35   6.12   8.63   8.59   8.58   8.50
     3   6.11   6.11   6.11 837.58   8.57   8.60   8.54   8.55
     4   8.57   8.57   8.61   8.52 838.70   6.12   6.12   6.12
     5   8.60   8.56   8.57   8.53   6.12 838.48   6.12   6.13
     6   8.57   8.60   8.62   8.52   6.13   6.13 838.70   6.13
     7   8.56   8.51   8.52   8.52   6.12   6.13   6.11 838.70
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 838.25   6.11   6.13   6.11   8.59   8.57   8.63   8.59
     1   6.11 838.25   6.12   6.11   8.58   8.56   8.51   8.58
     2   6.13   6.12 839.15   6.12   8.57   8.53   8.59   8.57
     3   6.11   6.12   6.11 838.48   8.57   8.56   8.53   8.54
     4   8.58   8.61   8.61   8.58 837.58   6.12   6.13   6.13
     5   8.58   8.55   8.56   8.51   6.13 838.70   6.12   6.13
     6   8.59   8.57   8.60   8.53   6.12   6.12 838.70   6.14
     7   8.53   8.55   8.52   8.51   6.11   6.13   6.11 838.70
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   1.59  15.33  12.53  14.32  11.99  14.87  13.17  15.05
     1  20.72   1.82  12.84  17.12  16.02  12.11  15.88  15.51
     2  11.73  17.40   1.66  15.80  12.91  12.62  13.04  17.11
     3  13.79  15.30  15.33   1.63  13.27  11.61  16.18  17.27
     4  13.87  16.41  17.14  12.52   1.56  12.54  14.50  17.29
     5  11.83  15.83  16.80  14.42  14.19   1.50  15.26  13.29
     6  12.10  17.59  15.76  16.73  12.95  13.73   1.56  16.53
     7  14.00  14.00  16.33  12.67  14.35  16.49  16.05   1.57

   CPU     0      1      2      3      4      5      6      7
     0   2.34   7.26   7.02   7.00   6.99   6.94   7.38   6.78
     1   7.19   2.34   7.11   6.83   7.04   6.83   6.81   6.69
     2   7.13   6.95   2.39   7.08   6.94   6.85   6.91   6.82
     3   7.31   6.84   7.41   2.29   7.13   7.13   6.94   6.84
     4   7.06   6.91   7.35   6.91   2.30   6.89   7.14   6.87
     5   7.63   6.82   6.83   6.79   6.88   2.21   6.92   6.76
     6   7.19   6.76   6.97   6.83   6.91   6.94   2.21   6.88
     7   7.03   6.81   6.89   6.85   7.16   7.04   6.93   2.24
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   1.59  16.72  17.86  15.76  11.82  17.38  17.40  16.32
     1  14.06   1.86  14.09  12.25  11.67  16.63  13.04  12.84
     2  11.61  16.80   1.66  16.81  12.15  12.35  16.14  16.98
     3  12.28  13.40  12.21   1.63  11.73  13.88  12.32  12.23
     4  11.70  16.70  17.26  15.54   1.62  17.68  14.96  15.60
     5  11.53  16.89  16.58  15.75  16.24   1.51  17.04  16.83
     6  13.69  12.18  12.55  13.43  13.74  17.89   1.55  13.49
     7  13.49  16.30  16.34  13.33  13.20  13.25  16.56   1.57

   CPU     0      1      2      3      4      5      6      7
     0   2.30   7.10   7.01   7.00   6.92   6.87   6.90   6.75
     1   7.11   2.27   7.12   6.87   6.88   6.79   6.80   6.99
     2   7.15   6.92   2.24   6.89   6.94   6.91   7.20   7.08
     3   7.26   6.89   6.90   2.22   7.27   6.84   6.91   6.82
     4   7.46   6.86   6.88   6.86   2.28   6.90   6.92   6.79
     5   6.91   6.77   6.86   7.00   6.92   2.19   7.28   6.81
     6   7.03   6.86   6.86   6.87   7.14   6.85   2.24   6.87
     7   6.95   6.80   6.84   6.79   6.86   7.02   7.17   2.23

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

I have disabled all ACSCtl. The output of sudo lspci -vvv | grep ACSCtl:

        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
samsja commented 7 months ago

hey did you find a solution to your problem ?

leewxgit commented 7 months ago

Hi, I think the answer is RTX3090 does not support P2P communication. And because it is a hardware problem, there is no way to enable P2P in RTX3090.

samsja commented 7 months ago

Hi, I think the answer is RTX3090 does not support P2P communication. And because it is a hardware problem, there is no way to enable P2P in RTX3090.

this is very weird I have seen a lot of setup where 3090 support p2p: example here: https://forums.developer.nvidia.com/t/parallel-training-with-4-cards-4090-cannot-be-performed-on-amd-5975wx-stuck-at-the-beginning/237813/10

seems that only the 4090 does not support it

leewxgit commented 7 months ago

Hi, I think the answer is RTX3090 does not support P2P communication. And because it is a hardware problem, there is no way to enable P2P in RTX3090.

this is very weird I have seen a lot of setup where 3090 support p2p: example here: https://forums.developer.nvidia.com/t/parallel-training-with-4-cards-4090-cannot-be-performed-on-amd-5975wx-stuck-at-the-beginning/237813/10

seems that only the 4090 does not support it

Thanks for your information. It is for sure that 4090 does not support it. I am not sure whether 3090 could, and this is why I arise this issue. In my case, I canot find any successful way to enable P2P in 3090.

samsja commented 7 months ago

Hi, I think the answer is RTX3090 does not support P2P communication. And because it is a hardware problem, there is no way to enable P2P in RTX3090.

this is very weird I have seen a lot of setup where 3090 support p2p: example here: https://forums.developer.nvidia.com/t/parallel-training-with-4-cards-4090-cannot-be-performed-on-amd-5975wx-stuck-at-the-beginning/237813/10 seems that only the 4090 does not support it

Thanks for your information. It is for sure that 4090 does not support it. I am not sure whether 3090 could, and this is why I arise this issue. In my case, I canot find any successful way to enable P2P in 3090.

I am almost sure that 3090 support it. Tho I am having issue as well to make it work :sweat_smile:

leewxgit commented 7 months ago

Hi, I think the answer is RTX3090 does not support P2P communication. And because it is a hardware problem, there is no way to enable P2P in RTX3090.

this is very weird I have seen a lot of setup where 3090 support p2p: example here: https://forums.developer.nvidia.com/t/parallel-training-with-4-cards-4090-cannot-be-performed-on-amd-5975wx-stuck-at-the-beginning/237813/10 seems that only the 4090 does not support it

Thanks for your information. It is for sure that 4090 does not support it. I am not sure whether 3090 could, and this is why I arise this issue. In my case, I canot find any successful way to enable P2P in 3090.

I am almost sure that 3090 support it. Tho I am having issue as well to make it work 😅

😂

samsja commented 7 months ago

Okay I think I found a way to fix the problem,

I downgraded from driver 545 to 535 on my ubuntu machine and now I don't have issue anymore

leewxgit commented 7 months ago

Excellent! I think I already tried this type of method before, but I would like to try it again. 😂