GPU Bandwidth - Githubissues

TheFl0w commented 7 years ago

When we ran the memory bandwidth test on your nvidia TITAN Black at PSI , we got some unexpected results. If I remember correctly, we measured about 6000 MiB/s for data transfer between host and GPU. PCIe 3.0 should actually give us twice the bandwidth. I ran the same tests on the GPU I use at home (GTX 780) so I could find out if consumer grade GPUs are more limited when it comes to data transfer rates. It turned out that data transfer for my card is as fast as data transfer of the Tesla K80x cards we use in our HPC cluster. Can you post the results for your TITAN Black, please?

Here is the output of the bandwidth test:

Pinned memory (physically contiguous)

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 780
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12172.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12454.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     213145.1

Pageable memory (virtually contiguous)

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 780
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6657.6

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6474.5

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     212451.2

mbrueckner-psi commented 7 years ago

What's the CPU clock frequency? The server's CPUs (where the titan is mounted) are Intel(R) Xeon(R) CPU E5-2680 0 with only 2.70GHz.

This is our output:

[l_brueckner_m@pc-jungfrau-test bandwidthTest]$ ./bandwidthTest [CUDA Bandwidth Test] - Starting... Running on...

Device 0: GeForce GTX TITAN Black Quick Mode

Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 5836.3

Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6533.9

Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 230384.2

[l_brueckner_m@pc-jungfrau-test bandwidthTest]$ ./bandwidthTest --memory=pageable [CUDA Bandwidth Test] - Starting... Running on...

Device 0: GeForce GTX TITAN Black Quick Mode

Host to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4026.1

Device to Host Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 3891.8

Device to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 231035.7

Am 10.05.2017 um 09:36 schrieb TheFl0w:

When we ran the memory bandwidth test on your nvidia TITAN Black at PSI , we got some unexpected results. If I remember correctly, we measured about 6000 MiB/s for data transfer between host and GPU. PCIe 3.0 should actually give us twice the bandwidth. I ran the same tests on the GPU I use at home (GTX 780) so I could find out if consumer grade GPUs are more limited when it comes to data transfer rates. It turned out that data transfer for my card is as fast as data transfer of the Tesla K80x cards we use in our HPC cluster. Can you post the results for your TITAN Black, please?

Here is the output of the bandwidth test:

Pinned memory (physically contiguous)

|[CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX 780 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12172.0 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12454.7 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 213145.1 |

Pageable memory (virtually contiguous)

|[CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX 780 Quick Mode Host to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6657.6 Device to Host Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6474.5 Device to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 212451.2 |

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ComputationalRadiationPhysics/jungfrau-photoncounter/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AXSCCVkRj0ZDsuan-GJTlPeX7tJBeKXjks5r4WjngaJpZM4NWSFm.

TheFl0w commented 7 years ago

$ /proc/cpuinfo says the GPUs on our HPC cluster are running on 32 cores of type Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

The test I did with my GPU at home was done on a Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

I get the same results in both cases.

TheFl0w commented 7 years ago

As far as I know CPU clock frequency does only matter for pageable memory anyway. If we transfer data from pinned memory, this is usually done with DMA, so the CPU would not be involved in copying data.

lopez-c commented 7 years ago

Hi, We need to find out the reason why the transfers are so slow. We will keep you up to date.

mbrueckner-psi commented 7 years ago

Hi

lspci -nvvs 27:00.0 [...] LnkCap: Port #0, Speed 5GT/s, Width x16, [...] LnkSta: Speed 2.5GT/s, Width x16, [...]

dmidecode [...] Handle 0x0908, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 8 Type: x16 PCI Express 3 Current Usage: In Use Length: Long ID: 8 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:27:00.0

lspci shows that the card can handle 5GT/s but it gets only 2.5GT/s. This is strange since 5GT/s is PCIe 2.0 (wikipedia) and NVidia claims that the Titan can do PCIe 3.0.

dmidecode shows that the slot can handle PCIe 3.0 with 16 lanes.

Martin

Am 10.05.2017 um 11:11 schrieb lopez-c:

Hi, We need to find out the reason why the transfers are so slow. We will keep you up to date.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ComputationalRadiationPhysics/jungfrau-photoncounter/issues/18#issuecomment-300423127, or mute the thread https://github.com/notifications/unsubscribe-auth/AXSCCY3sOpi3e2sHhx_SimqniLXd76Ksks5r4X9BgaJpZM4NWSFm.

TheFl0w commented 7 years ago

@mbrueckner-psi To be honest, I have no idea how to fix this. Off the top of my head, I would say:

could be a bug in the driver, make sure you are using the latest version
firmware problem on mainboard
a power connector of the PSU is not working anymore
PCIe slot faulty, maybe try another slot
faulty GPU

If you have physical access to the system, maybe try to check the PSU connectors and just put the GPU out of the PCIe slot and back in.

TheFl0w commented 7 years ago

I would like to gather some additional information about your GPU. Can you run the program I attached and post the results?

benchmark.zip

lopez-c commented 7 years ago

Hi, This is the result of the benchmark:

CUDA Driver version: 8000 CUDA Runtime version: 8000

Devices: GeForce GTX TITAN Black Compute capability: 3.5 Global memory: 6082.31 MiB DMA engines: 1 Multi processors: 15 Warp size: 32 Max concurrent kernels: 1 Max grid size: 2147483647, 65535, 65535 Max block size: 1024, 1024, 64 Max threads per block: 1024

For some reason we are trying to understand, it looks like the link between the CPU and the GPU is PCIe v2.0 instead of PCIe v3.0.

In theory the GPU is compatible with PCIe 3.0 and the slot where it is connected to as well. So yes, a bit strange.

mbrueckner-psi commented 7 years ago

I've tried this :

Edit /etc/modprobe.d/local.conf or create a new file like /etc/modprobe.d/nvidia.conf

and add this

options nvidia NVreg_EnablePCIeGen3=1

but did not work

Cheers Aldo

On 05/10/2017 11:24 AM, Martin Brückner wrote:

Hi

lspci -nvvs 27:00.0 [...] LnkCap: Port #0, Speed 5GT/s, Width x16, [...] LnkSta: Speed 2.5GT/s, Width x16, [...]

dmidecode [...] Handle 0x0908, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 8 Type: x16 PCI Express 3 Current Usage: In Use Length: Long ID: 8 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:27:00.0

lspci shows that the card can handle 5GT/s but it gets only 2.5GT/s. This is strange since 5GT/s is PCIe 2.0 (wikipedia) and NVidia claims that the Titan can do PCIe 3.0.

dmidecode shows that the slot can handle PCIe 3.0 with 16 lanes.

Martin

Am 10.05.2017 um 11:11 schrieb lopez-c:

Hi, We need to find out the reason why the transfers are so slow. We will keep you up to date.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ComputationalRadiationPhysics/jungfrau-photoncounter/issues/18#issuecomment-300423127, or mute the thread https://github.com/notifications/unsubscribe-auth/AXSCCY3sOpi3e2sHhx_SimqniLXd76Ksks5r4X9BgaJpZM4NWSFm.

TheFl0w commented 7 years ago

Okay, so I have tried looking for more possible explanations for our GPU bandwidth problem. A processor supports a certain number of PCIe lanes. In your case that max number of lanes should be 40. However, different chipsets on motherboards support a different number of PCIe lanes. Sometimes the number of lanes available per PCIe socket depends on how many devices are connected. For example: If there are devices plugged into socket 1 and 3, only 8 lanes respectively are available. I can look into this, but I need to know what motherboard is used, which PCIe sockets are used and how many lanes are (theoretically) occupied by those devices.

TL;DR: Max # of PCIe lanes could be the issue. For now I would like to know the model of the motherboard.

mbrueckner-psi commented 7 years ago

Hi,

it's the server HP ML350P Gen8.

The GPU sits in a suitable slot and see the output of lspci and dmidecode. The there are 16 lanes connected to the GPU. As said before: Card and Mainboard support PCIe 3 (8GT/s) but the link is only PCIe 2 (5GT/s). This limits the bandwith to max 8GB/s.

Am 11.05.2017 um 15:29 schrieb TheFl0w:

Okay, so I have tried looking for more possible explanations for our GPU bandwidth problem. A processor supports a certain number of PCIe lanes. In your case that max number of lanes should be 40. However, different chipsets on motherboards support a different number of PCIe lanes. Sometimes the number of lanes available per PCIe socket depends on how many devices are connected. For example: /If there are devices plugged into socket 1 and 3, only 8 lanes respectively are available/. I can look into this, but I need to know what motherboard is used, which PCIe sockets are used and how many lanes are (theoretically) occupied by those devices.

TL;DR: Max # of PCIe lanes could be the issue. For now I would like to know the model of the motherboard.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ComputationalRadiationPhysics/jungfrau-photoncounter/issues/18#issuecomment-300789326, or mute the thread https://github.com/notifications/unsubscribe-auth/AXSCCXEc85StzV_PweDWVEvUHxdkOlPVks5r4w09gaJpZM4NWSFm.

TheFl0w commented 7 years ago

This is what the manual says about the expansion slots.

Expansion Slot #	Technology	Bus Width	Connector Width	Bus Number	Form Factor	Notes
9	PCIe 3.0	x4	x8	32	Full Length / Height	For processor 2
8	PCIe 3.0	x16	x16	32	Full Length / Height	For processor 2
7	PCIe 3.0	x4	x8	32	Full Length / Height	For processor 2
6	PCIe 3.0	x16	x16	32	Full Length / Height	For processor 2
5	PCIe 2.0	x4	x8	0	Full Length / Height	For processor 2
4	PCIe 3.0	x4	x8	0	Full Length / Height	For processor 1
3	PCIe 3.0	x16	x16	0	Full Length / Height	For processor 1
2	PCIe 3.0	x4	x8	0	Full Length / Height	For processor 1
1	PCIe 3.0	x8	x16	0	Full Length / Height	For processor 1

dmidecode reported:

Designation: PCI-E Slot 8 
Type: x16 PCI Express 3

Expansion slot 1 has a connector width of x16 while it only supports x8. Please make sure the card is not plugged into slot 1. If so, consider placing it in slot 3 instead.

TheFl0w commented 7 years ago

If Linux labels the PCIe slots correctly, I am out of ideas for now. I will ask around at work tomorrow, maybe this is a common problem.

ComputationalRadiationPhysics / jungfrau-photoncounter

GPU Bandwidth #18

CUDA Driver version: 8000 CUDA Runtime version: 8000