Unexpected behavior of CL_MEM_EXTR_PTR_XILINX and XCL_MEM_EXT_P2P_BUFFER

moazin commented 2 years ago

Background I was trying to measure the highest P2P bandwidth that can be achieved between two Alveo U200 boards in my setup when I accidentally discovered a problem. I'm running the example p2p_fpga2fpga from the Vitis_Accel_Examples repository. I've done a few modifications to be able to measure the P2P bandwidth when larger buffers are transferred. The modifications are as following:

I modified both the kernels to not use an internal buffer and thereby made them able to handle buffers of any arbitrary size passed via an argument as is done right now.
In the host code, I set LENGTH to a high value like (65536*1024). Original was 65536.
To avoid segmentation faults, I modified the code to allocate the buffers in1, in2, etc on heap.

After doing these changes, I tested the bandwidth with LENGTH set to 65536 and got it to be around 3.5 GB/s. The typical PCIe bandwidth usually is around 9 to 10 GB/s so this was a bit surprising.

The Issue I accidentally changed one line in the code and removed the CL_MEM_EXTR_PTR_XILINX flag. I'll show the before and after below. Before:

    cl_mem madd_in;
    cl_mem_ext_ptr_t min = {XCL_MEM_EXT_P2P_BUFFER, nullptr, 0};
    OCL_CHECK(err,
              madd_in = clCreateBuffer(context[1], CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX, buffersize, &min, &err));

After:

    cl_mem madd_in;
    cl_mem_ext_ptr_t min = {XCL_MEM_EXT_P2P_BUFFER, nullptr, 0};
    OCL_CHECK(err,
              madd_in = clCreateBuffer(context[1], CL_MEM_READ_ONLY, buffersize, nullptr, &err));

Doing this modification and running the code, I get a bandwidth of ~10GB/s. The rest of the code is exactly the same and yet I get this bandwidth and the results are totally correct. Ideally, I'd expect the latter code to not work at all because that flag is missing and I'm doing P2P transfer, but yet it works and provides better bandwidth.

What do I want? I'm clueless why this is happening and what's going on under the hood? Is this a bug? Shouldn't the code throw errors? Is the transfer P2P with this modification at all or it's copying to host RAM and then transferring to the other board?

System Ubuntu 18.04 XRT 2021.2 Vitis/Vivado 2021.2 Platform: xilinx_u200_gen3x16_xdma_1_202110_1

mamin506 commented 2 years ago

@chienwei-lan / @maxzhen , any idea about why P2P performance is getting better bandwidth without "CL_MEM_EXT_PTR_XILINX"? This is a potential bug.

maxzhen commented 2 years ago

Don't you need "CL_MEM_EXT_PTR_XILINX" to pass in "min" where you can specify P2P flag to alloc a P2P BO? Otherwise, you'll be allocating a normal BO and perform normal DMA. 3.5G/s sounds reasonable for P2P and 10G/s sounds reasonable for normal DMA.

uday610 commented 2 years ago

Without that EXT_PTR flag there is no P2P, so data transfer will happen via host. So let's not consider that.

With EXT_PTR flag it is P2P, do you have a case where you saw higher bandwidth for P2P and now see lower-bandwidth?

moazin commented 2 years ago

@maxzhen So here I've missed the CL_MEM_EXT_PTR_XILINX and didn't pass in the min either. But the code to copy stuff remains the same, via P2P, just like it was in the original example I've linked above. Something like:

int fd = -1;
    OCL_CHECK(err, err = xcl::P2P::getMemObjectFd(madd_in, &fd)); // Import p2p buffer to file descriptor (fd)
    if (fd > 0) {
        std::cout << "Import FD:" << fd << std::endl;
    }

    cl_mem exported_buf;
    OCL_CHECK(err, err = xcl::P2P::getMemObjectFromFd(context[0], device_id[0], 0, fd, &exported_buf)); // Import
    cl_event event;
    OCL_CHECK(err,
              err = clEnqueueCopyBuffer(queue[0], mmult_out, exported_buf, 0, 0, sizeof(data_t) * LENGTH, 0, nullptr,
                                        &event)); // transfer
    clWaitForEvents(1, &event);

Except that now all these buffers are regular buffers instead of P2P buffers. I'd expect this code to throw an error saying that P2P transfer can't happen if the buffers are not P2P but instead it runs fine.

I think three things are possible:

P2P transfer is still happening with 10 GB/s bandwidth.
While the copy code is the same, under the hood the transfer is happening via normal DMA giving 10 GB/s.

I think it's (2) happening here but I'm wondering if this code is supposed to work with non P2P transfers because it uses xcl::P2P::getMemObjectFd and xcl::P2P::getMemObjectFromFd which are supposed to be used for P2P transfers only, I guess?

Also are there ways to confirm if the transfer is happening via normal DMA under the hood instead of P2P?

uday610 commented 2 years ago

I'm wondering if this code is supposed to work with non P2P transfers because it uses xcl::P2P::getMemObjectFd and xcl::P2P::getMemObjectFromFd which are supposed to be used for P2P transfers only, I guess?

This xcl::P2P::getMemObjectFd has nothing to do with P2P specific, that xcl::P2P namespace is added inside the host-code of the example, https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/common/includes/xcl2/xcl2.hpp#L99 , underneath it is calling xclGetMemObjectFd which works for any Buffer object. I understand that xcl::p2p namespace is misleading, but those API has nothing to do with P2P

moazin commented 2 years ago

Thanks a lot @uday610. So I guess using these APIs to do regular DMA transfers is a valid way to do things since the underlying function call has nothing to do with P2P specifically.

uday610 commented 2 years ago

Yes, those API just getting the FD of the buffer, which is not related to the buffer type. I think you can try one thing to prove this. So you have enabled p2p of the devices by xbutil command right? Like xbutil p2p --enable ( I dont know which version of XRT you use, so I dont know you use old xbutil commands or the new ones), right? Now you can disable the p2p of the devices by xbutil command and make sure it is disabled by xbutil query/examine command, and run your testcase again.

moazin commented 2 years ago

@uday610 You're right. Disabling the P2P from xbutil everything still works and I get the ~10GB/s bandwidth.

@maxzhen I've one question.

Don't you need "CL_MEM_EXT_PTR_XILINX" to pass in "min" where you can specify P2P flag to alloc a P2P BO? Otherwise, you'll be allocating a normal BO and perform normal DMA. 3.5G/s sounds reasonable for P2P and 10G/s sounds reasonable for normal DMA.

You say that 10 GB/s is reasonable for normal DMA. So if we measure host-to-fpga bandwidth, it does turn out to be around ~10 GB/s. Here, the transfer is from fpga1 --> host --> fpga2, so we should expect to get half the bandwidth as there are two transfers that need to happen. I'm not sure why I'm getting 10 GB/s? Note that this bandwidth is measured by measuring the time it takes the transfer to happen.

moazin commented 2 years ago

@uday610 @maxzhen Following up on the previous question.

I've been thinking about this and two things don't make sense to me.

I'm not sure why P2P bandwidth would be slower than the typical DMA transfer through host. Where does this number 3.5 GB/s come from? Does it come from a limitation of Xilinx FPGA systems and drivers or is it a limit of the PCI Gen 3 standard?
When the transfer is not via P2P, I looked at the XRT source code and it seems to me that there are three possibilities. Something called M2M, KDMA or via host. I'm not sure which one is happening here and why am I getting 10 GB/s bandwidth when it's not P2P. If the two transfers (1. FPGA 1 --> host, 2. host --> FPGA 2) happen one by one then the bandwidth I get should be half of 10 GB/s.

maxzhen commented 2 years ago

P2P transfer speed is really depends on the PCIE topology. If the two devices are under the same switch, it might be faster if the switch vendor implement it that way. The more switch you need to hop b/w these two devices, the slower. If the P2P data transfer needs to go through root complex, that will be even slower since host CPU optimize device to memory path. They are not motivated to optimize data transfer by-passing CPU. Same thing probably is true for all PCIE switch chip vendors. If not P2P, it is the XDMA who moves data b/w device and host memory and this path is normally optimized by all chip vendors down the road. PCIE is bi-directional bus, both direction operates at the same time and can both at 10GB rate. So, you don't see half of speed.

HitMeFirst commented 2 years ago

Without that EXT_PTR flag there is no P2P, so data transfer will happen via host. So let's not consider that.

With EXT_PTR flag it is P2P, do you have a case where you saw higher bandwidth for P2P and now see lower-bandwidth?

@uday610 @maxzhen When I was running the example p2p_bandwidth from the Vitis_Accel_Examples repository.

The Issue I found that the program will use a lot of host memory whether with/without EXT_PTR flag. (For example, when I read 1GB data from SSD to FPGA, the program will use about 1GB host memory Maximum.)

I'm confused that if it‘s p2p, the DMA transfer should not via host memory, but why does it need so many host memory to complete the data transfer?

Thanks.

System Ubuntu 20.04 XRT 2021.2 Vitis 2021.2 Platform: xilinx_u2_gen3x4_xdma_gc_2_202110_1

moazin commented 2 years ago

@HitMeFirst How were you measuring the memory usage?

HitMeFirst commented 2 years ago

@HitMeFirst How were you measuring the memory usage?

@moazin By using linux top command to get the data of 'RES'.

Xilinx / XRT

Unexpected behavior of CL_MEM_EXTR_PTR_XILINX and XCL_MEM_EXT_P2P_BUFFER #6613