corundum / corundum

Open source FPGA-based NIC and platform for in-network compute
https://corundum.io/
Other
1.55k stars 387 forks source link

Tools for AXI Lite register reads/writes and AXI streaming read/writes #163

Open AdamI75 opened 10 months ago

AdamI75 commented 10 months ago

@alexforencich I am in the process of testing your Corundum register reads and writes and your Corundum streaming reads and writes from the DMA PCIe. I just wanted to know whether you could point me to the right direction for a suitable tool like Xilinx uses XDMA to write/read to control/status registers and streaming data too?

I see you specify the full register map here - https://docs.corundum.io/en/latest/rb/index.html. Please can you give a few examples of how to read and write to registers via your PCIe interface and how to send/receive data via the streaming interface, thanks? If possible please use the user application for the example, thanks?

I wanted to know whether there is spare register space for the user application that we can use? I am not really seeing this in your documentation?

I will investigate this more next week.

AdamI75 commented 10 months ago

@alexforencich Okay, so I can use setpci for PCI configuration - https://docs.xilinx.com/r/en-US/pg213-pcie4-ultrascale-plus/Configuration-Space. These registers are not on the card itself, but rather created by the BIOS itself, I think? I see that you have created corundum/utils/mqnic-config etc to setup the configuration user space on the Corundum card itself. You seem to have developed C applications to write and read the registers, but I don't think you have documented this in detail. I am looking to use a utility like pcimem or devmem to write and read to the registers and stream tx and rx data via the PCIe DMA interface to/from the host. Do you have an application (e.g. python) for that or do you know what ubuntu utilities I should use for this just to get a basic understanding of the low level operations? An example of a use case would be very useful, thanks.

alexforencich commented 10 months ago

That's actually not a register map, that's just a list of the register blocks and their associated type and version. The layout varies depending on the target board and various design parameters, so the registers do not have defined addresses, instead the register block tree must be enumerated to figure out where things are located.

Currently, the only really supported way of accessing the registers is via the userspace library and userspace tools. Take a look at the template example application for a very simple example of how to do this. The DMA benchmark application currently only has a kernel module as there isn't currently a good way to handle DMA from userspace. And streaming stuff is really only set up for network traffic via the device driver.

The application section gets BAR2 on its own dedicated AXI lite interface. So you can connect whatever you want to that and organize it however you like. The BAR size is set in config.tcl, so you can adjust this as needed. Eventually BAR4 will be connected to on-card RAM, but this hasn't been implemented yet.

setpci is certainly a useful tool for poking at config space, but I don't think this is going to help much for accessing application section logic. Either pcimem, using mmap directly, or using the userspace library is probably going to be the way to go for accessing the register space. Unfortunately, there also isn't a good way to stream data to/from the card, aside from transferring network traffic via the kernel driver/DPDK PMD. I have been thinking about ways of implementing this in a clean way, but so far haven't cooked up something sufficiently flexible, and it's also not high on my priority list as I'm mainly doing bump-in-the-wire style packet processing, time sync, and time-based scheduling. Honestly, the application block is a bit of a placeholder at this point and still needs quite a bit of fleshing out.

alexforencich commented 10 months ago

I should also mention that Corundum isn't really intended as a drop-in replacement for the Alveo shell and associated framework. Corundum is specifically intended to be a NIC, and to support networking-related applications that can take advantage of custom hardware integrated with a high-performance NIC. On the other hand, the Alveo framework isn't designed to support networking, it's designed more for compute offload applications where userspace software will interact with a self-contained accelerator card via PCIe, similar to how you might use a GPU to handle certain types of computation. So, doing that kind of offloading with Corundum is going to be a bit rough at the moment.

AdamI75 commented 10 months ago

@alexforencich thanks for all the useful info on the registers. Ahhh, that makes sense that it is not a register map. I was looking for base addresses and wondering how this was meant to work. I am used to specifying registers in the documentation with base addresses and offsets. Okay, I will look at your examples and take it from here. I was able to access the githash by using ./pcimem /dev/mqnic1 0x24 w. However, I was only able to access this block - https://docs.corundum.io/en/latest/rb/fw_id.html#rb-fw-id. I suspect, it is because the address starts at 0x0 and the others don't, so I don't know how the registers have been allocated in memory to access them. I will try the example and see how you have done it. I will get back to you if I still can't figure it out.

Okay, so streaming may be an issue. I will investigate this. Tell me when you stated: " there also isn't a good way to stream data to/from the card, aside from transferring network traffic via the kernel driver/DPDK PMD". Do you mean that you can only stream to the NIC and not the PCIe interface? You can receive Ethernet data and send that via PCIe to the host, right? I see the kernel driver as communicating via the PCIe interface but only has access to system DMA memory, right? Therefore, I should be able to send data via Ethernet, do some processing on it and send it to the kernel driver, which will store it in system DMA - please correct me if I am wrong in any of this, thanks.

AdamI75 commented 10 months ago

@alexforencich I see you are using register blocks as linked lists in Corundum and each register block is a node on the link list looking at your template example. Therefore, I should be able to create my own c file that can access other registers besides the FW register block using your functions. I think it would be great if the kernel driver could populate the register names somewhere so that we could get easy access to it using their names i.e. the driver opens the device and populates the positions of all the registers and then writes these values into a list of dedicated names and then all I need to do is use pcimem with these dedicated names to get access to the registers or use another utility to assign the memory address to the names, then use the memory address with pcimem, so I don't need to run applications all the time. I can just use a script if I want. As you said, this is not a high priority for you, but maybe we can assist here if we decide to go with this framework for our prototype. Hopefully, I am understanding correctly. I will create my own c code to write and read from the register blocks to test. I will keep you posted.

AdamI75 commented 10 months ago

@alexforencich Okay, I am now able to read back the register blocks using a modification of your c template example. Just for interest, is it possible to print out the associated register block PCI memory addresses using your existing software functions, so that I can use pcimem as an alternative? I think for now using a similar c file to your example will work though, but I think we will eventually want to automate this address map to use other applications.

alexforencich commented 10 months ago

I mean I guess we could, but IMO it probably makes more sense just to expose the parameters you would want to poke at as sysfs files or something. What registers do you want to access?

For streaming data, all I'm getting at is there isn't any nice copy_to_card method somewhere that you can call. For streaming data, the driver only supports sending/receiving packet data to/from the network stack (which is all transferred via PCIe DMA). Currently, there is no easy way to, say, set up a data buffer in userspace point the DMA engine at the card without involving the kernel networking stack.

AdamI75 commented 10 months ago

@alexforencich I think the way that you are reading and writing registers using a C application and functions, is perfectly fine. I need to be able to access the same Register Blocks that are listed in your documents. I think what is needed for Corundum is an official comms C platform/GUI that can read all the Register Blocks and write to all the relevant ones. The user can then integrate this with Control and Monitor software, which can decide what to do if there is a failure etc. This is what we have for our MeerKAT Radio Astronomy telescope, although we use Python. I would then generate some more documentation and tutorials for the open source community. This is something SARAO could look at if we decide to use the Corundum framework.

Just to clarify. I am hearing you are able to support sending/receiving data to/from the network stack and this uses the PCIe DMA system host memory, but there is no way to bypass the kernel OS for the network stack with your kernel driver. Therefore, I should still be able to have an application running in host system memory that can send/receive data to the Corundum card via the PCIe DMA streaming interface to say the user application, but it will still use the kernel? Put another way, what if I don't want to stream network packets, but rather my own custom data via PCIe DMA to the user application and then say back to the host DMA? Would this be possible or would I need to update the Corundum firmware or/and driver for this. In the end, I want to be able to receive data via the 100G NIC, do some unpacking, then process this data using the user app and finally stream it via PCIe DMA to the system host memory - originally not bypassing the kernel? I should be able to simulate this at the very least using cocotb and Icarus Verilog, correct? I know I have asked this before, but just need this to be clarified. If not, then this is effort that I must factor in for Corundum. SARAO would look at this if we decide to go with your framework.

I think userspace would be where we wanted to end up eventually to bypass the kernel networking stack e.g. explore the benefits of DPDK in improving the PCIe DMA throughput. I need to test the MLE DPDK driver in more detail.

AdamI75 commented 10 months ago

@alexforencich I thought I would let you know that SARAO has decided to utilise the Corundum framework after our evaluation process for our own SmartNIC development. This means that we (SARAO) would be able to support and collaborate to the open source Corundum community - just need to decide on what we contribute e.g. comms utility, DPDK integration, docs/tuts etc.

One thing that we would like to add is a standard comms utility package for Corundum that any user could use as is to read and write to the register blocks. The idea is that we would write this in Python using your existing C function API calls, but utilise Pythons libraries for analysing and displaying data. This could be integrated into the Corundum repo with suitable documentation explaining how to utilise this comms utility with Corundum. The idea being that this comms utility can read and write to all existing register blocks without the user having to create their own comms utility each time. It could be in its own repo and referenced by your repo as a submodule - similar to how it would work with DPDK eventually, which we could also help contribute towards eventually.

Let me know your thoughts, thanks? Looking forward to being part of the community that you have created :).

AdamI75 commented 8 months ago

@alexforencich I have compiled in a user application, but I am having trouble in reading back the registers via the C application. It keeps saying that the application section is not present when I know it is. This could be an issue with how I am coding the C-Application. I use the following code for now:

screen shot user app reg 1 Code continued: user_app_screen_shot_2

How do I get dev->app_regs to populate with the actual size of app_regs? It's size is always zero unless I give it a known size. I am able to readback the other register blocks fine provided they exist, but it is just the user application regs that the design does not seem able to read back. Some example code should help. I used the DMA test bench as a baseline.

Thanks for your help.

alexforencich commented 8 months ago

The sizes come from the PCIe BAR sizes; those sizes are not exposed in registers in the BARs themselves. The userspace tools have two ways of getting this information. When connecting to /dev/mqnicN, this information is made available via ioctls and then used to mmap the BARs. When connecting without the driver loaded, each BAR appears as an mmapable file in sysfs, with the file size corresponding to the BAR size.

alexforencich commented 8 months ago

I suppose I should add that the OS gets the sizes during enumeration via a BAR sizing process - it writes all 1s to the BARs, and then checks how many LSBs read as 0, as that corresponds to the BAR size. Then there will be some way to get this info from the OS later on, as you don't want to write to the BARs in config space after the device is active.

AdamI75 commented 8 months ago

Therefore, I assume if I change the PCIe BAR 2 size in config.tcl from 0 to the actual size then I should be able to read back the values. Is it set to zero by default i.e. (0, 0)? The issue is that this value is set to (0 , 0) in config.tcl. I am assuming that your C-application takes care of ioctls and mmap the BARS? I don't need to do anything? Is there any other parameter in config.tcl or anywhere else that I need to set? Maybe you can give me a few examples of how to configure config.tcl so that dev->app_regs will be populated? Thanks!

alexforencich commented 8 months ago

The config.tcl script should set the BAR size automatically, all you need to do is set AXIL_APP_CTRL_ADDR_WIDTH and APP_ENABLE. The 0 0 and 0 2 refer to the PF index and BAR index; those should not be changed.

alexforencich commented 8 months ago

Also, you can check the BAR list in lspci and the IP core config in Vivado. One other potential gotcha: if you use the hot reset script, the app BAR may not enumerate properly if the config that was enumerated at boot (probably the one in the flash) has a smaller app BAR (or no app BAR). So, you may need to do a warm reboot instead of a hot reset so the FPGA gets enumerated properly.

AdamI75 commented 8 months ago

Thanks, @alexforencich . Do you perhaps have some examples of C application code that can read back the user application registers? This will help too. I just need a few lines just to make sure that I am doing this correctly, thanks? Your help is most appreciated.

alexforencich commented 8 months ago

https://github.com/corundum/corundum/blob/master/fpga/app/template/utils/app-template-test.c basically just does a single test write and read of the app BAR

AdamI75 commented 8 months ago

Haha, thanks @alexforencich . I am using that, but no luck. Maybe the BAR is not being properly enumerated. Okay, so if I use this code then why is application reporting "Application section is not present" when I have specified it in the user application verilog file? It seems that this should be fully automated, so I shouldn't have an issue? Sorry for all these questions, but it seems like I am doing the right thing from what you say.

AdamI75 commented 8 months ago

@alexforencich I did the lspci and sure enough it says that it is "unassigned", so I think this is the issue I am having. I may need to warm reboot the card, which I thought I had, but I see someone else has configured the card. I will keep you posted, but this looks to be the source of my issues, thanks.

lspci_screen_shot_unassigned

Status update: okay, I think the image in flash does not use BAR2, but the image I configure does and I think it is not enumerating properly. Therefore, the 16M BAR2 size is either too big or it needs to be configured in flash with this size, otherwise it will not enumerate. I think this is my issue. I have seen this with the Alveo before when the BAR has different settings in flash to what you programme it with - like you said.. The dmesg command shows that BAR2 is too big and can't be assigned. At least I now know the source of my frustrations, thanks. I will keep you posted.

alexforencich commented 8 months ago

FYI your image link is bad. Anyway, I'm thinking I should probably add some code somewhere to complain about BARs not being properly mapped to avoid this kind of confusion. Presumably this could be implemented in both the userspace tools and the driver without too much difficulty.

AdamI75 commented 8 months ago

@alexforencich Thanks, I fixed the image. It just shows that BAR2 is "unassigned". Yes, some additional documentation or code checking will be useful for users, thanks. I will keep you posted once I can read back the user application. Thanks, Alex!

Status update: I am now able to read and write from/to the user application registers. Thanks for the support. By the way, warm reboot didn't work. I had to flash the card with the BAR2 user application registers. I find that if the BAR size changes at all then the flash needs to be reconfigured. Once this happens, then the BAR2 memory region will be assigned on the PCIe memory bus and the user can then access the user application registers. I think this should be documented for other users.