NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
898 stars 144 forks source link

Why D2H is relatively slower? #269

Closed yuxuanliuuu closed 1 year ago

yuxuanliuuu commented 1 year ago

Hi, 

I am trying to understand this line in the README.

Slow D-H, because the GPU BAR, which backs the mappings, can't be prefetched and so burst reads transactions are not generated through PCIE

I have 2 questions.

Why GPU BAR cannot be prefetched? As far as I know, BAR can be I/O resources or memory resources. The BARs should be labeled non-prefetch when they are control registers of the devices. However, since GPU BAR serves as GPU memory, why it cannot be labeled as prefetch?

In the README, no prefetched =>no PCIe burst reads transactions. Why the former leads to the latter?

I am curious about the reasons and would be more than happy if you could answer my questions. Thank you.

drossetti commented 1 year ago

That text is quite old and is incorrect for coherent platforms, e.g. GH200 and POWER9+V100, where the GPU memory can be cached by the CPU. In that case, entire cache lines are exchanged and good performance is obtained.

Why GPU BAR cannot be prefetched?

On non-coherent platforms, the CPU can create MMIO mappings of the GPU BAR. Prefetching per se refers to cache coherent platforms, and it does not apply to MMIO (UC) mappings. Current CPU ISAs do not support reading or writing larger than 8B/16B granules when targeting MMIO mappings, even when using vector extensions. On x86 platforms, Write-Combining (WC) mappings can improve write bandwidth, but do not help read performance.

Note that here I refer to CPU loads and stores, i.e. reads and writes generated by CPU instructions, which are interesting because of the their low latency properties. Besides that, there can be DMA engines, similar to the GPU Copy Engines, which provide full performance in exchange of higher latencies.

In the README, no prefetched =>no PCIe burst reads transactions. Why the former leads to the latter?

That is an experimental statement. When using cached mappings, loads and stores would trigger read/write transactions of entire cache lines, leading to large size PCIe transactions. As of today, PCIe does not support a cache coherency protocol, so on non-coherent platforms, the CPU mappings of the GPU BAR cannot be cached (C). The best you can do there is to use MMIO or WC mappings.

I hope that helps.

yuxuanliuuu commented 1 year ago

You have already solved my problem. Thanks a lot for your response!!!