alexforencich / xfcp

Extensible FPGA control platform
MIT License
52 stars 20 forks source link

xfcp + 10G on VCU118 #2

Open np84 opened 3 years ago

np84 commented 3 years ago

Hi Alex,

I have organized some 10G hardware (QSFP28->SFP+ converters, 10G SFP+ modules, fibre cables). Your verilog-ethernet VCU118 10G loopback runs like charm. However, there is no 10G version of xfcp. Is there any technical reason for that or should it work to simply combine the VCU118 10G loopback example with the VCU118 1G xfcp example? Best, Nico

alexforencich commented 3 years ago

So, the XFCP codebase is really designed around a fixed 8 bit datapath, as the design goals are more about flexibility than high performance. Therefore, running the whole thing at 10G would require a pretty significant overhaul. However, that's probably not necessary if you don't actually need all that bandwidth, what you can do instead is simply connect width converters between the 10G MAC and the XFCP UDP interface module and it should work just fine. And incidentally, I may do just that on a board that I need to do some transceiver characterization on that only has QSFP and PCIe.

alexforencich commented 3 years ago

Actually, the 10G MAC should already contain width converters as it should use the axis_fifo_adapter modules. So, you should be able to simply set the interface width on the 10G MAC to 8 bits. (This feature is intended for when you want to run with a wider interface and slower clock as is done in Corundum when operating at 25 Gbps (MAC is 64 at 390 MHz, but core logic runs at 250 MHz, so it uses a 128 bit internal interface), but you can certainly go narrower too if you want).

np84 commented 3 years ago

I think that is exactly what I need. I will try this out soon. You are right that I do not need 10 Gbps but having it slightly faster is not bad at all. Am I correct that using the 8 bit 10G MAC interface and connecting the MAC clock (390 MHz) to xfcp should result in ~3.1 Gbps then? Is that possible or should I use another clock for xfcp?

alexforencich commented 3 years ago

At 10G, the clock is 156.25 MHz. It's 390 MHz at 25 Gbps. I like to run the XFCP logic at 125 MHz so I know all of the cycle counters and such increment at a consistent rate, but TBH it doesn't really make that much difference and running everything at 156 should be fine. However, running at 390 MHz could cause issues with timing closure.

np84 commented 3 years ago

xfcp over QSFP/SFP+ actually works perfectly fine with 156.25 MHz. I'm not sure if this is exactly what you suggested, but I am using the "axis_adapter" module to convert between 8 bit and 64 bit. If you do not have any further suggestions, we can close this issue. Thanks!

alexforencich commented 3 years ago

Nice. I just added a stratix 10 example that does this. Since the MAC uses fifo adapter modules, all I had to do was set the data width to 8 bits. Take a look at: https://github.com/alexforencich/xfcp/blob/master/example/S10MX_DK/fpga/rtl/fpga_core.v#L151 . No additional modules required. That one is set up to use a core clock of 125 MHz to be consistent with the other designs, but if you want to run at 156.25 that works too.

np84 commented 3 years ago

I did that and also clocked it to 125 MHz---everything works as expected. However, it turns out that the bandwidth is very low. I do not understand what is going on, but I cannot achieve more than 700 Kbps (yes, Kbps) write speed. I measured this by writing a 32 bit int (x) via interface.write(0,x.to_bytes(4,'little')) with 10^6 repetitions to your "xfcp_mod_wb" MemoryNode. It is clear that I will not achieve the full 1Gbps (8bit at 125MHz), but ~500Mbps would be nice. Do you have any idea? Can you verify this on your side?

mpkopec commented 3 years ago

Hi, Since this discussion is interesting also for the XFCP implementation in one of the projects I am working on, I have made some calculations and would appreciate if Alex confirmed, denied or corrected them.

First of all, I assumed you used write() from MemoryNode. If so, note that saving one, 32-bit value takes a request and a response from the XFCP node, thus you send a packet with a value, wait for the response and based on that you know if the value is written. Consider this (see my notes for calculations), you have an Ethernet frame, then IPv4 frame in it, then UDP and lastly XFCP. From my rough calculations (I don't know the structure of the memory packet, since I don't use it, but I assumed best case - short packet - with only XFCP routing and 32 bits of your data) all of that takes up 73 bytes plus your data, which already yields 50 Mbps, because you use only 4 bytes out of total 77.

If you add the delay between the request and response equal to 10 us and the response packet with the same length, you efectively use only 4 bytes out of two packets and 10 us of time. Since 10 us is 10000 bits with 1 Gbps link, this adds up to the bandwidth being only 3.15 Mbps, which is close to your value, but not yet there. We are still off by a factor of around 5. The calculations do not however account for Python code execution, etc.

To inspect the delay between the request and response you can use Wireshark, depending on the connection (point-to-point, through a single switch, through more networking equipment) you can get the time delta as low as <1us as far as I remember.

The conclusion is - it all depends on how you design your node and how much data at once you send. For raw Ethernet frame payloads you can achieve 960 Mbps real data bandwidth if you fill up the whole frame and you keep up with keeping the link busy all the time. Also, for measuring bandwidth you could use a simple node that would send some number of full UDP over Ethernet packets using XFCP (the length considering the protocol stack structure should be calculated). After reception of the first packet you would save the timestamp, and after last packet save another. The number of packets should be, imho, large enough for at least a few seconds of constant traffic and the connection should be point-to-point to eliminate the network equipment from the equation.

xfcp_pkt_19

alexforencich commented 3 years ago

Sounds about right. The memory packet header carries the address and transfer length, so it will be a handful of bytes. But, that overhead will be dwarfed by the other headers. It's going to be a lot more efficient if you transfer a block of data in one shot, writing to adjacent addresses.

However,the biggest issue is how the python code always waits for the response packet, at least for memory operations. It may make sense to rework the python code to mitigate this issue by supporting multiple in-flight operations at the same time, though I'm not sure what the best approach to that would be.

mpkopec commented 3 years ago

Thanks for the confirmation.

Just to clarify, I made a mistake in my last sum, I added bytes to bits, but this has little impact on the end number, which already is an estimate.

np84 commented 3 years ago

Ok, combining multiple blocks into one bigger transfer raised the throughput to ~37Mbps for my own xfcp_mod (xfcp_mod_wb should be even faster). This is "ok" for now, but might be a problem for me in the future... :) maybe some day we can test how this can be improved by optimizing the python code.

alexforencich commented 3 years ago

Another thing that you can do is open multiple connections. The setup is smart enough to handle that correctly so you can have multiple scripts or even multiple hosts talking to the same design at the same time. Obviously you need to make sure that the different programs do not interfere with each other, but it definitely works and I have done this before several times, for instance with one control script and one passive monitoring script.

I will also say that one of the design goals of XFCP was to keep things as simple and as easy to use as possible. So, if it's possible to make changes to improve throughput without vastly increasing the code size and interface complexity, then it's worth looking in to.

np84 commented 3 years ago

I appreciate the simplicity of XFCP and I'm very happy with it. My downstream task is to "offload" computation to the fpga (actual computation can take minutes). So it does not really matter if offloading takes a couple of seconds (which is what I have now with the ~40 Mbps). There is a chance that the offloaded tasks will become 4-16 times larger in the future but the time horizon for this is >1 year. So I am good for now. Thus, the following question is just because of my curiosity: is there a conceptual issue with 64-bit-xfcp? I.e., is there more to do beyond expanding all relevant ports from 8 to 64 bits?

mpkopec commented 3 years ago

Actually, I am currently working on a data upstream node for XFCP and plan to do some throughput testing, I can report on my findings if you like.

The simplicity of XFCP is really great and I think you can achieve a lot with proper design of the nodes and thinking ahead about the Python interface.

alexforencich commented 3 years ago

You know, now that I think about it, doing 64-bit ports is probably less of an issue than I originally thought. The reason for this is that the switches always have to add or remove a single byte. Yes, this is slightly more annoying with a 64-bit interface than with an 8-bit interface, but since it's always 1 byte then you don't need a full barrel shifter or something. I will put parametrizing the interface width on my to-do list. The harder module to parametrize is the COBS encoder/decoder, but that's only used for the serial port, so I don't think that will be a problem.

np84 commented 3 years ago

Having a parameterizable interface would be perfect :) In fact, I wrote an xfcp_mod_ifconfig to configure the ip address etc after flashing the bitstream. It would hence be nice to not loose the xfcp uart interface. However, if the COBS encoder/decoder is a problem, I can also configure the IP settings via DIP switches.

alexforencich commented 3 years ago

Nah, the UART interface won't go away, it just requires the use of width converters on that path. And yes, having a way to configure the IP address, MAC address, and maybe UART baud rate could be useful. Even better if that can store that in on-card EEPROM, which many FPGA boards have. Or perhaps even in the config flash.

alexforencich commented 3 years ago

Also, I should mention a couple of other things: 1. I usually "wrap" xfcp_mod_wb to provide a configuration interface for my modules. TBH, I don't think I have a single module I have made (aside from the I2C master module) that aren't built like that. (Does your ifconfig module use custom packet formats, or do you also "wrap" xfcp_mod_wb for the configuration interface?) 2. I have a handful of modules like a trigger generator module, PRBS generators and checkers, and a few others that I need to clean up and push into this repo at some point.

np84 commented 3 years ago

Yes, I also use xfcp_mod_wb to store IP, MAC, etc. Storing it in some persistent location would be great but I never played with this. I will check the docs.. One other thing that I might contribute in the future is an xfcp_interface_xdma.v for Alveo cards. It is not working yet, but the plan is 1) to use the accelerated design flow, 2) encapsulate everything (xfcp and all modules) in a "free running RTL Kernel", 3) write a C++ module to communicate with the kernel, 4) integrate the C++ module via ctypes into interface.py. I know that we can also use the "classic" design flow and that the accelerated design flow comes with some overhead, but it allows us to rely on the xdma platform shell which simplifies the PCIe communication a lot (at least, this is my hope)

mpkopec commented 3 years ago

As for the configuration's external storage, I was thinking about elaborating i2c_init module. On the board I am currently working on, we have an EEPROM, which has an official MAC address stored inside, so we need a module, which would send and receive a bunch of packets on the I2C bus and get the data saving it to some parallel outputs (e.g. for MAC and IP).

As for the configuration through Wishbone, I have done something different, i.e. created a XFCP module which has parallel inputs and outputs (RW registers and RO registers). I plan to prepare Python implementation in such a way, that the registers would be accessible using [] operators. This is not yet release ready though.

mpkopec commented 2 years ago

Sorry for reopening the issue...

Another thing that you can do is open multiple connections. The setup is smart enough to handle that correctly so you can have multiple scripts or even multiple hosts talking to the same design at the same time. Obviously you need to make sure that the different programs do not interfere with each other, but it definitely works and I have done this before several times, for instance with one control script and one passive monitoring script.

As far as I understand the code in the xfcp_interface, the source port of the packet coming to the FPGA is rewritten to be the destination port of the next outgoing XFCP packet, am I right? If so, when connecting from more than one port (more than one socket), the response will be sent to the port of the last incoming packet. This means that multiple scripts should be somehow synchronised to avoid such a situation. This is even worse if one has a node that can send data without request. @alexforencich Could you comment on that?

alexforencich commented 2 years ago

TBH, I think it may be a bit less of an issue than it might appear. Basically, there is very little buffering that takes place inside of the interconnect, so when one request is being processed the next one is effectively blocked inside of the RX FIFO before it can even be fully processed by the UDP stack. At least the way most of the current XFCP components work, they start generating the response while processing the request packet, effectively blocking both paths through the on-chip network until the request has been handled.

For things sending data without a request, I don't really have a good method for handling this right now. I would actually need to check on what would happen in that case, since the node would have to include the correct reverse-path for it to hit one port or the other if you have both serial and USB interfaces. This isn't really a priority though as none of the modules I use send any unprompted responses.

Anyway, one idea I had to alleviate any potential issues with this sort of thing is to store some transient connection state and use the reverse-path capability to associate the response with the connection. Potentially this could also be used send unprompted transmissions to the appropriate client, but I don't have a nice way of configuring that at the moment.

mpkopec commented 2 years ago

Agreed, for the modules you have in the repo it all checks out. I have written 2 more XFCP modules, one of which is capable of data transfer without request. Since we don't plan to see UART and UDP, or more precisely, more than 1 rpath entry, I simply save the rpath in a register and then send it back to the same location as saved. I wanted to do that explicitly not to block the TX path, as the data sending module can be attempting to send the next frame.

Nonetheless, thanks a lot for the reply and I think the rpath capability as a means to reverse back to the proper port would be very interesting and I would love to see that in the future.

Btw. do you have any paper on the XFCP I could cite in my research?

mpkopec commented 2 years ago

To be honest we plan to resolve the data transmission issue software-wise.