Open moazin opened 2 years ago
@chienwei-lan / @maxzhen , any idea about why P2P performance is getting better bandwidth without "CL_MEM_EXT_PTR_XILINX"? This is a potential bug.
Don't you need "CL_MEM_EXT_PTR_XILINX" to pass in "min" where you can specify P2P flag to alloc a P2P BO? Otherwise, you'll be allocating a normal BO and perform normal DMA. 3.5G/s sounds reasonable for P2P and 10G/s sounds reasonable for normal DMA.
Without that EXT_PTR flag there is no P2P, so data transfer will happen via host. So let's not consider that.
With EXT_PTR flag it is P2P, do you have a case where you saw higher bandwidth for P2P and now see lower-bandwidth?
@maxzhen So here I've missed the CL_MEM_EXT_PTR_XILINX
and didn't pass in the min
either. But the code to copy stuff remains the same, via P2P, just like it was in the original example I've linked above. Something like:
int fd = -1;
OCL_CHECK(err, err = xcl::P2P::getMemObjectFd(madd_in, &fd)); // Import p2p buffer to file descriptor (fd)
if (fd > 0) {
std::cout << "Import FD:" << fd << std::endl;
}
cl_mem exported_buf;
OCL_CHECK(err, err = xcl::P2P::getMemObjectFromFd(context[0], device_id[0], 0, fd, &exported_buf)); // Import
cl_event event;
OCL_CHECK(err,
err = clEnqueueCopyBuffer(queue[0], mmult_out, exported_buf, 0, 0, sizeof(data_t) * LENGTH, 0, nullptr,
&event)); // transfer
clWaitForEvents(1, &event);
Except that now all these buffers are regular buffers instead of P2P buffers. I'd expect this code to throw an error saying that P2P transfer can't happen if the buffers are not P2P but instead it runs fine.
I think three things are possible:
I think it's (2) happening here but I'm wondering if this code is supposed to work with non P2P transfers because it uses xcl::P2P::getMemObjectFd
and xcl::P2P::getMemObjectFromFd
which are supposed to be used for P2P transfers only, I guess?
Also are there ways to confirm if the transfer is happening via normal DMA under the hood instead of P2P?
I'm wondering if this code is supposed to work with non P2P transfers because it uses xcl::P2P::getMemObjectFd and xcl::P2P::getMemObjectFromFd which are supposed to be used for P2P transfers only, I guess?
This xcl::P2P::getMemObjectFd
has nothing to do with P2P specific, that xcl::P2P namespace is added inside the host-code of the example, https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/common/includes/xcl2/xcl2.hpp#L99 , underneath it is calling xclGetMemObjectFd which works for any Buffer object. I understand that xcl::p2p namespace is misleading, but those API has nothing to do with P2P
Thanks a lot @uday610. So I guess using these APIs to do regular DMA transfers is a valid way to do things since the underlying function call has nothing to do with P2P specifically.
Yes, those API just getting the FD of the buffer, which is not related to the buffer type. I think you can try one thing to prove this. So you have enabled p2p of the devices by xbutil command right? Like xbutil p2p --enable
( I dont know which version of XRT you use, so I dont know you use old xbutil commands or the new ones), right? Now you can disable the p2p of the devices by xbutil command and make sure it is disabled by xbutil query/examine command, and run your testcase again.
@uday610 You're right. Disabling the P2P from xbutil
everything still works and I get the ~10GB/s bandwidth.
@maxzhen I've one question.
Don't you need "CL_MEM_EXT_PTR_XILINX" to pass in "min" where you can specify P2P flag to alloc a P2P BO? Otherwise, you'll be allocating a normal BO and perform normal DMA. 3.5G/s sounds reasonable for P2P and 10G/s sounds reasonable for normal DMA.
You say that 10 GB/s is reasonable for normal DMA. So if we measure host-to-fpga bandwidth, it does turn out to be around ~10 GB/s. Here, the transfer is from fpga1 --> host --> fpga2, so we should expect to get half the bandwidth as there are two transfers that need to happen. I'm not sure why I'm getting 10 GB/s? Note that this bandwidth is measured by measuring the time it takes the transfer to happen.
@uday610 @maxzhen Following up on the previous question.
I've been thinking about this and two things don't make sense to me.
P2P transfer speed is really depends on the PCIE topology. If the two devices are under the same switch, it might be faster if the switch vendor implement it that way. The more switch you need to hop b/w these two devices, the slower. If the P2P data transfer needs to go through root complex, that will be even slower since host CPU optimize device to memory path. They are not motivated to optimize data transfer by-passing CPU. Same thing probably is true for all PCIE switch chip vendors. If not P2P, it is the XDMA who moves data b/w device and host memory and this path is normally optimized by all chip vendors down the road. PCIE is bi-directional bus, both direction operates at the same time and can both at 10GB rate. So, you don't see half of speed.
Without that EXT_PTR flag there is no P2P, so data transfer will happen via host. So let's not consider that.
With EXT_PTR flag it is P2P, do you have a case where you saw higher bandwidth for P2P and now see lower-bandwidth?
@uday610 @maxzhen When I was running the example p2p_bandwidth from the Vitis_Accel_Examples repository.
The Issue I found that the program will use a lot of host memory whether with/without EXT_PTR flag. (For example, when I read 1GB data from SSD to FPGA, the program will use about 1GB host memory Maximum.)
I'm confused that if it‘s p2p, the DMA transfer should not via host memory, but why does it need so many host memory to complete the data transfer?
Thanks.
System Ubuntu 20.04 XRT 2021.2 Vitis 2021.2 Platform: xilinx_u2_gen3x4_xdma_gc_2_202110_1
@HitMeFirst How were you measuring the memory usage?
@HitMeFirst How were you measuring the memory usage?
@moazin By using linux top command to get the data of 'RES'.
Background I was trying to measure the highest P2P bandwidth that can be achieved between two Alveo U200 boards in my setup when I accidentally discovered a problem. I'm running the example p2p_fpga2fpga from the Vitis_Accel_Examples repository. I've done a few modifications to be able to measure the P2P bandwidth when larger buffers are transferred. The modifications are as following:
LENGTH
to a high value like (65536*1024). Original was 65536.in1
,in2
, etc on heap.After doing these changes, I tested the bandwidth with
LENGTH
set to 65536 and got it to be around 3.5 GB/s. The typical PCIe bandwidth usually is around 9 to 10 GB/s so this was a bit surprising.The Issue I accidentally changed one line in the code and removed the
CL_MEM_EXTR_PTR_XILINX
flag. I'll show the before and after below. Before:After:
Doing this modification and running the code, I get a bandwidth of ~10GB/s. The rest of the code is exactly the same and yet I get this bandwidth and the results are totally correct. Ideally, I'd expect the latter code to not work at all because that flag is missing and I'm doing P2P transfer, but yet it works and provides better bandwidth.
What do I want? I'm clueless why this is happening and what's going on under the hood? Is this a bug? Shouldn't the code throw errors? Is the transfer P2P with this modification at all or it's copying to host RAM and then transferring to the other board?
System Ubuntu 18.04 XRT 2021.2 Vitis/Vivado 2021.2 Platform: xilinx_u200_gen3x16_xdma_1_202110_1