Open drossetti opened 6 years ago
Thank you for taking a look. Which CPU, GPU and PCIe topology did you test? Can you report copy_to_mapping perf ?
Thanks for your response!
CPU - Intel Xeon Silver 4114 (Skylake) GPU - Tesla P100-PCIE-12GB CUDA version - 11.4
Here are the gdr_copy_to_mapping
numbers for AVX512 -
gdr_copy_to_mapping num iters for each size: 10000 Test | Size(B) | Avg.Time(us) |
---|---|---|
gdr_copy_to_mapping | 1 | 0.1250 |
gdr_copy_to_mapping | 2 | 0.1245 |
gdr_copy_to_mapping | 4 | 0.1245 |
gdr_copy_to_mapping | 8 | 0.1222 |
gdr_copy_to_mapping | 16 | 0.1263 |
gdr_copy_to_mapping | 32 | 0.1252 |
gdr_copy_to_mapping | 64 | 0.1280 |
gdr_copy_to_mapping | 128 | 0.1376 |
gdr_copy_to_mapping | 256 | 0.1439 |
gdr_copy_to_mapping | 512 | 0.1550 |
gdr_copy_to_mapping | 1024 | 0.1927 |
gdr_copy_to_mapping | 2048 | 0.2631 |
gdr_copy_to_mapping | 4096 | 0.4262 |
gdr_copy_to_mapping | 8192 | 0.8239 |
gdr_copy_to_mapping | 16384 | 1.6179 |
gdr_copy_to_mapping | 32768 | 3.2132 |
gdr_copy_to_mapping | 65536 | 6.4094 |
gdr_copy_to_mapping | 131072 | 12.7935 |
gdr_copy_to_mapping | 262144 | 25.5790 |
gdr_copy_to_mapping | 524288 | 51.1738 |
gdr_copy_to_mapping | 1048576 | 102.2248 |
gdr_copy_to_mapping | 2097152 | 204.4293 |
gdr_copy_to_mapping | 4194304 | 409.7942 |
gdr_copy_to_mapping | 8388608 | 822.7885 |
gdr_copy_to_mapping | 16777216 | 1683.7191 |
As for the PCIe topology, I'm not sure, but I did a lspci -tv
:
-+-[0000:d7]-+-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-0e.0 Intel Corporation Device 2058
| +-0e.1 Intel Corporation Device 2059
| +-0f.0 Intel Corporation Device 2058
| +-0f.1 Intel Corporation Device 2059
| +-12.0 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.1 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.2 Intel Corporation Sky Lake-E M3KTI Registers
| +-15.0 Intel Corporation Sky Lake-E M2PCI Registers
| +-16.0 Intel Corporation Sky Lake-E M2PCI Registers
| \-16.4 Intel Corporation Sky Lake-E M2PCI Registers
+-[0000:ae]-+-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-08.0 Intel Corporation Device 2066
| +-09.0 Intel Corporation Device 2066
| +-0a.0 Intel Corporation Device 2040
| +-0a.1 Intel Corporation Device 2041
| +-0a.2 Intel Corporation Device 2042
| +-0a.3 Intel Corporation Device 2043
| +-0a.4 Intel Corporation Device 2044
| +-0a.5 Intel Corporation Device 2045
| +-0a.6 Intel Corporation Device 2046
| +-0a.7 Intel Corporation Device 2047
| +-0b.0 Intel Corporation Device 2048
| +-0b.1 Intel Corporation Device 2049
| +-0b.2 Intel Corporation Device 204a
| +-0b.3 Intel Corporation Device 204b
| +-0c.0 Intel Corporation Device 2040
| +-0c.1 Intel Corporation Device 2041
| +-0c.2 Intel Corporation Device 2042
| +-0c.3 Intel Corporation Device 2043
| +-0c.4 Intel Corporation Device 2044
| +-0c.5 Intel Corporation Device 2045
| +-0c.6 Intel Corporation Device 2046
| +-0c.7 Intel Corporation Device 2047
| +-0d.0 Intel Corporation Device 2048
| +-0d.1 Intel Corporation Device 2049
| +-0d.2 Intel Corporation Device 204a
| \-0d.3 Intel Corporation Device 204b
+-[0000:85]-+-00.0-[86]----00.0 NVIDIA Corporation GP100GL [Tesla P100 PCIe 12GB]
One caveat is that I probably could've used the -mavx512vl
compilation flag to use up to 32 ymm registers for both AVX & AVX2, but I didn't. I wonder if loop-unrolling in the source-code should be tweaked if 32 registers are to be leveraged, instead of the default 16.
Using AVX-512 based memcpy is a bad idea, in general.
This is how
gdr_copy_from_mapping
does with AVX512 (In fact, its SSE4.1 version is faster than its AVX version, and the source code prefers it over the AVX version).