alexforencich / verilog-pcie

Verilog PCI express components
MIT License
1.13k stars 300 forks source link

Example host code on top of the kernel module? #6

Open FPGA-Bot-Yang opened 4 years ago

FPGA-Bot-Yang commented 4 years ago

Hi Alex,

After I've managed to compile the kernel module and loaded it using 'insmod', is there a receommended host c code that can make use of the kernel (example.ko) and perform some simple read and write test?

Thank you for your time!

alexforencich commented 4 years ago

example.ko is standalone; it does not provide any userspace interfaces of any kind. All of the functionality is in the probe method, once that returns the driver does nothing until it is unloaded. The example design top-level verilog code is also just very basic "sanity check" code; you would probably want a more extensive set of control logic including descriptor handling logic for a full DMA engine.

If you want to see a more extensive use case for the PCIe DMA modules in this repo, take a look at https://github.com/ucsdsysnet/corundum .

I am planning on extending the example design code in this repo to include at least some sort of basic benchmarking code, but I have not had the time to do so yet. A full PCIe DMA engine with userspace driver to rival, say, the XDMA core would take more time, and I have not had a use case that would facilitate development of that yet.

FPGA-Bot-Yang commented 4 years ago

Thanks Alex!

I know that Xilinx also provides their own host side DMA driver for XDMA as well (https://github.com/Xilinx/dma_ip_drivers), which is working fine.

What I'm trying to look for is a driver that can directly write data from host side to FPGA, without using the DMA protocols. While DMA is quite popular, it requires at least 6 transcations between host and FPGA for a simple data write operation. From my understanding and simulation, I think it requires the follwing steps: 1, host sending descriptor address to FPGA (single trip from Host to Card); 2, host send the 'run' signal to FPGA (single trip from Host to Card); 3, FPGA fetch descriptor from host memory (two trips: from Card to Host, then Host to Card); 4, FPGA request the data from the address extracted from the previous fetched descriptor (two trips: from Card to Host, then Host to Card).

These steps create a lot of overhead, which is not ideal for low latency designs. What I have in mind, is to consolidate the 4 steps into a single one that, that the host CPU directly write the payload data (as well as the destination address) to FPGA, within a single TLP that have the payload data attached.

I'm not 100% sure if this exists, but all I want is a single trip operation that can write a small number of data (100s of byte) from CPU to FPGA directly, without the need for using descriptors, etc. I'm wondering if I can achieve that through the kernel driver you have here, maybe with some modifications. Do you think this is an viable approach?

Again, thank you so much for your help!

alexforencich commented 4 years ago

I have heard a number of proposals for doing something similar for super low latency packet processing. Presumably there is a serious latency vs. bandwidth trade-off with that, but for sufficiently latency sensitive applications that will be acceptable.

For card to host transfers, you need DMA if the card is going to initiate the operation. But for host to card, you should be able to just write into the PCIe BAR that corresponds to the PCIe AXI master (not the PCIe AXI lite master...that can only sink 32 bit operations, everything else is rejected). For that, you don't even need a driver, all you need to do is unload example.ko, mmap /sys/bus/pci/devices/.../resource1 (I think), and start writing. You can then use an ILA instance to look at what you're getting on the card.

Actually, you may need to make some adjustments to the example design to get write combining working...I think the BAR needs to be set as 'prefetchable', though I am not 100% certain about that. And it may need to be switched to a 64 bit bar. So you may need to change bar 1 to bar 2, which will also require updating some of the TLP routing code, driver code, and a couple of other things.

FPGA-Bot-Yang commented 4 years ago

Thanks, Alex! What you suggest is exactly what I have in mind. :-) Thanks for pointing out the steps.

"For card to host transfers, you need DMA if the card is going to initiate the operation." Actually in my design, I might also need to return a small volume of data (1-2 bytes) from card to host after the execution on FPGA is done. Would there be a similar workaround to eliminate the descriptor for C2H operations? If there is no such workaround for removing descriptors for C2H, I think I can have the CPU keep reading a certain address on FPGA until a valid data is appear. Do you think that is also feasible (supposed I don't care about host CPU efficiency for now)?

alexforencich commented 4 years ago

You don't need descriptors necessarily, the FPGA just needs to get the host address somehow. You could directly write the host address to the card and avoid using descriptors, if it makes sense to do so. Or you could use descriptors and have the FPGA read them in in advance. But both of these require knowing the physical address of some buffer in host memory, which cannot be done from userspace. Polling a register on the card does not require this, but it will generate a lot of PCIe traffic which could possibly interfere with other operations on the card. If you have a single thread talking to the card, this probably won't be an issue.

FPGA-Bot-Yang commented 4 years ago

Hi Alex, I used your kernel code with my custom FPGA image (basically an Integrated PCIe block IP core with an user application code). I'm able to read and write 32-bit data (using iowrite32 and ioread32) now. But when I try to write large chunk of data, from your code, I believe it's this part:

dev_info(dev, "$$$$$ Start Copy Data to Card ... $$$$$"); iowrite32((edev->dma_region_addr+0x0000)&0xffffffff, edev->bar[0]+0x000100); iowrite32(((edev->dma_region_addr+0x0000) >> 32)&0xffffffff, edev->bar[0]+0x000104); iowrite32(0x100, edev->bar[0]+0x000108); iowrite32(0, edev->bar[0]+0x00010C); iowrite32(0x100, edev->bar[0]+0x000110); iowrite32(0xAA, edev->bar[0]+0x000114);

The data transfer dosen't seem happening. Using ILA, I can confirm I received data like '0x100', '0', '0x100', '0xAA' is arrived and stored on the card (which I believe is some sort of descriptor data), but nothing happened after that. I think this might be related with not having an "DMA-like" processing logic in my HDL. Could you point out where I could find the HDL code of that part in your example designs (I tried to make under verilog-pcie/example/VCU118/fpga_axi_x8, but I got an error stating: No rule to make target 'fpga.bit', needed by 'fpga')?

Thank you so much for your help!

alexforencich commented 4 years ago

Those operations get passed to the PCIe DMA module that the example design pulls in:

https://github.com/alexforencich/verilog-pcie/blob/master/example/ADM_PCIE_9V3/fpga_axi_x8/rtl/fpga_core.v#L862

Not sure why you would be getting 'no rules to make target' from make. That means it can't find one of the source files. I just tested on a clean copy of the repo and it seems to be working fine. Are you developing in windows? Windows has an aversion to symlinks, so you'll either have to edit the makefile with the true paths, copy the files to where the makefile expects them to be, or use linux where symlinks are supported.

FPGA-Bot-Yang commented 4 years ago

Thanks Alex! I redownload the repository and everything works fine for me now. :-)

Following my previous question regarding writing data from host side to FPGA without using the DMA method. Using ILA, I can confirm that the "iowrite32" works perfectly. But I notice that between two consequective "iowrite32", there is an approximate 25 clock cycles delay (running at 250MHz). I think what "iowrite32" does is, correctly me if I'm wrong: every time this function is called, it will generate a single TLP with 32-bit (1 DW) payload. I'm wondering is there is a more efficient way to directly write more that 32-bit of data in a single TLP. Let's say, I would like to send 1KByte (256 DW) of data, is there a way to pack them into a single TLP and finish the write with a single call of function like we did for "iowritex32"?

Thank you so much for helping me out!

alexforencich commented 4 years ago

So, AFAIK, you are limited by the CPU word size. However, if you change the BAR configuration to use prefetchable BARs, then you might be able to take advantage of write combining. In this case, you would not use iowrite32, you would simply write to the memory addresses in question, and let the CPU caching infrastructure combine the writes. I have never done this before, so I'm not sure exactly how to do it or how effective it would be.

FPGA-Bot-Yang commented 4 years ago

Hi Alex,

I have a quick follow up question. With your code, I'm able to write/read from CPU to FPGA. All those operations are initialized from CPU side. Now, if I want to initialize a write operation from FPGA side, and write to a certain address on the CPU host memory, is there any modifications needed on the kernel code? From my understanding, the Linux kernel should be able to handle those FPGA write requests automatically, all I need to do is to constantly check the write target address and see if the data is there. Am I understand this correctly?

Thank you so much for the help!

alexforencich commented 4 years ago

Well, you'll need to get the physical address of the target memory location onto the FPGA somehow, then you can issue write request TLPs from the FPGA. A polling loop where the host checks for a DMA write from the card is probably going to be the absolute lowest latency method for signalling from the card back to the host. But I'm not sure if this can be done nicely in kernel code; generally drivers are interrupt-driven. You could also communicate that via a register value that the host can read over PCIe, but the latency would be higher due to round trips over the PCIe bus.

Btw, the kernel does not handle anything here, the writes are handled in hardware by the CPU uncore.

FPGA-Bot-Yang commented 4 years ago

Thanks for the quick reply. Regarding the physical address, below is the mapping address output returned by the kernel code:

[  320.871680] pci_fpga 0000:af:00.0: Allocated DMA region virt ffff8800aaf88000, phys 00000000aaf88000

[  320.882020] pci_fpga 0000:af:00.0: BAR[0] 0xdbffff00000-0xdbffff007ff flags 0x0014220c

[  320.891012] pci_fpga 0000:af:00.0: BAR[2] 0xf3800000-0xf38007ff flags 0x00040200

[  320.899458] pci_fpga 0000:af:00.0: BAR[0] mapped at 0xffffc900194fc000 with length 2048

[  320.908548] pci_fpga 0000:af:00.0: BAR[2] mapped at 0xffffc900194fe000 with length 2048

When I initialize write operation from CPU to FPGA (BAR0), using ILA, I can see the address in the CQ_tdata is 0x00000DBFFFF00000, which matches the BAR0 starting address.

Now, when I try to initialize a write operation from FPGA to host, in the RQ_tdata output, shall I set rq_tdata[63:2] filed as "0x00000DBFFFF00000" (suppose I want to write back to the same location)? Or I should set the address filed in rq_tdata as the host address (0xffffc900194fc000)?

alexforencich commented 4 years ago

The BARs are for host->FPGA communication. Going the other way, you need to allocate DMA-accessible memory. The call to do that will give you both a pointer to the memory that the kernel can use, and the physical address that you can provide to the FPGA. So the answer to your direct question is "neither".

alexforencich commented 4 years ago

The BARs are for host->FPGA communication. Going the other way, you need to allocate DMA-accessible memory with something like dma_alloc_coherent. The call to do that will give you both a pointer to the memory that the kernel can use, and the physical address that you can provide to the FPGA. Or you can allocate pages, and then DMA map the pages to get the physical address. So the answer to your direct question is "neither". You don't want the FPGA to write to its own BAR, that would be silly. And the host address is a virtual address and has no meaning on the FPGA side of the MMU.

FPGA-Bot-Yang commented 4 years ago

Thanks for the clarification, Alex!

To make sure I understand you correctly: what you suggest is allocate another segment of memory that is dedicated for FPGA-to-CPU transfer (that is independent of CPU-to-FPGA transfer), am I right?

Also, when I generated the TLP on the RQ port on FPGA, the address should always be the physical address on the host side, that is neither the FPGA BAR address (which indicates the FPGA memory address), nor the host virtual address (the memory-mapped address for FPGA). And the physical address should point to the know address allocated for FPGA-to-CPU transfer, right?

alexforencich commented 4 years ago

That's correct.
This is how the Corundum NIC works: transmit data is passed to the driver in the form of data structures called SKBs. The driver maps these SKBs, and passes the physical address of the memory to the card. After the packet is sent, the driver unmaps and frees the SKBs. In the receive direction, the driver allocates memory pages, maps those, and passes the physical addresses to the card. After the packet is received by the NIC, the driver unmaps the pages, attaches them to SKBs, and hands them off to the network stack.
Now, how you get the physical address to the card is something that could require some thought. Corundum has a whole descriptor handling subsystem so that the host doesn't actually write the addresses to the card, it only writes them out into descriptor queues in host memory, and then the card can read them via DMA. You may want to implement something similar. Or maybe that doesn't make sense for your application.

FPGA-Bot-Yang commented 4 years ago

Thanks for the details! What I'm thinking is attach the physical address at the end of payload data when sending data from host to FPGA (suppose there are a fixed amount of data need to be transferred each time), and parse it out on the FPGA side.

As for allocating a dedicated memory for FPGA-to-CPU transfer using _dma_alloccoherent, I notice that you already used that function in your kernel code. I assume I can just recall that function one more time, with some twicks like below (of course I will also add those new attributes to example_dev struct definition):

edev->dma_wb_region_len = 1024;
edev->dma_wb_region = dma_alloc_coherent(dev, edev->dma_wb_region_len, &edev->dma_wb_region_addr, GFP_KERNEL | __GFP_ZERO);

Would this make any sense to you?

alexforencich commented 4 years ago

The example driver is just a super stripped down example, and that region is used to test the DMA engine. So if you don't need it you can remove that allocation or use it for something else. I'm just not sure if one allocation is going to be sufficient for your application, or if you need to do something more complex. If one is sufficient for everything, then no problem, you're probably good to go reusing the existing allocation.

FPGA-Bot-Yang commented 4 years ago

I only need a small memory space for taking FPGA write back data. Thus I find your example driver pretty useful for my case. Thanks a lot!

FPGA-Bot-Yang commented 4 years ago

Hi Alex,

I followed your suggestions and successfully implemented/verifieid the FPGA initialized data write-back functionality. Thank you so much for your help!

One thing I notice, that is not the way I expected, is the round trip latency is way longer than I expected. In my test design, I let FPGA directly write back a frame of data as soon as it receives a trigger pattern from host CPU. In this way, I aim to measure the round trip latency by recording and starting time before CPU send the trigger pattern, and the ending time when the FPGA write-back data can be read out from host side. For a 64-bit of data, the round trip time reaches on the milliscend level. In comparison, when I let host CPU to initialize the FPGA write then read, the round trip time is merely 600 ns. This is not what I expected. Intuitively, I since the FPGA initialized write-back only takes a single trip, while host initialized FPGA read take doule trips (from host to FPGA, then from FPGA to host), the former design suppose to take shorter time, right?

One thing worth mention is that in my current kernel code, I comment out the MSI IRQ related code (due to compatibility issue). The general functionality seems working fine even without IRQ (at least for the host initialized operations). In the case of the FPGA initialized write-back, I think this might be an issue, since the host CPU is not expecting the FPGA write-back data, thus it cannot pick it up as soon as it can. Am I understand this right?

Thanks a lot!

alexforencich commented 4 years ago

This is likely related to caching. You may need to flush the cache or something if you're expecting the FPGA to write, otherwise the CPU might be reading stale data for some time.

FPGA-Bot-Yang commented 4 years ago

That's a very good point! Thanks for the suggestions!

BTW, do you think it's necessary to setup the IRQ? Even without the IRQ, the host initialized read operation seems to provide a decent latency performance (about 300um). Does this mean the IRQ not essential here?

alexforencich commented 4 years ago

If you're polling, then the IRQ isn't really doing much of anything.

FPGA-Bot-Yang commented 4 years ago

What about FPGA initialized write to host CPU?