Koheron / koheron-sdk

SDK for FPGA / Linux Instruments
https://www.koheron.com/software-development-kit/
Other
101 stars 41 forks source link

TCP connection occasionally hangs between koheron server and computer #569

Open chrisjbillington opened 1 year ago

chrisjbillington commented 1 year ago

We're experiencing an issue that's having us tearing our hair out a little bit, in which the TCP connection between the koheron server and a computer seems to stall.

We've got a computer connected to an ALPHA250 directly via an ethernet cable, with both the ALPHA250 and the computer configured with static IP addresses. A program running on the computer is making repeated calls (every ~50ms) to the koheron server on the ALPHA250 to acquire arrays of data from the FPGA, which are 131072 bytes per call.

What we are seeing is that at random intervals, sometimes within a minute, sometimes tens of minutes, the receiving program gets stuck in the sock.recv() call of KoheronCleint.recv_all(). Inspecting local variables in recv_all() reveals that without exception, when it gets stuck, the client has received all but 760 bytes of the expected 131072 bytes, and is waiting for the remaining 760 bytes. On other occasions, however, the final chunk is of size 760 bytes and is read successfully without stalling.

During such a hang, pinging the ALPHA250, or pressing a key in a running ssh session with the ALPHA250 will cause the connection to become unstuck. No data is lost, there is no error, the program receives its expected 760 remaining bytes and continues.

During a hang, we can connect to the ALPHA250 over the USB serial monitor, and doing so does not cause the stalled connection to become unstuck. We can see over the serial monitor that things look normal, the CPU is not hanging, the koheron-server program is consuming the expected amount of CPU. Via the serial monitor we can run code on the CPU to talk to the koheron server over a unix or localhost TCP socket and read other registers from the FPGA. So the koheron-server program itself is not stuck, and neither is the FPGA. We see that if over the serial interface we run the ping command to ping the computer from the ALPHA250, that does cause the connection to become unstuck.

If we connect a USB ethernet adapter to the ALPHA250 (having rebuilt the kernel with the necessary drivers), and ssh in via that network interface whilst the other one is stalled, we see that doing so does not cause the stalled connection to become unstuck.

We only see stalls when using the onboard ethernet adapter to talk to the koheron-server. If we use the USB ethernet adapter to read data from the koheron server, we have not observed a stall that way.

We do not see stalls when the ALPHA250 and the computer are connected to each other via a switch, only when they are connected directly via an ethernet cable.

We have run wireshark on the computer as a stall is occurring, and what we see looks normal - many packets coming from the ALPHA250 as the data is transferred, each followed by an immediate ACK from the computer. The last packet is an ACK from the computer, and we don't see a packet that looks like 760 bytes plus a header, it looks like that doesn't arrive until after a stall is over.

We have attempted to do packet capture on the ALPHA250 (using tshark) as well, to see what a stall looks like there, but have not observed a stall whilst packet capture was in progress. We hypothesise that putting the network interface in promiscuous mode for packet capture (or something else about doing packet capture) is preventing whatever is causing the stalls to occur. It's also possible we got unlucky during the test and a stall just happened to not occur.

We see that whether or not stalls occur, and how frequently, seems to depend on what FPGA bitstream we are using - making changes to the FPGA code for our instrument seems to matter. The specific changes that cause it to hang more or less seem more or less arbitrary.

The OS image running on the ALPHA250 is a modified version of that from the Ubuntu 20.04 branch of the koheron-sdk repository. We have added some lines to the kernel defconfig to enable some drivers, have applied a patch from wireguard to backport wireguard to the kernel, have modified the device tree to enable kernel support for the on-board real time clock chip, and have replaced NTP with systemd-timesyncd.

The OS image and instrument were built on Ubuntu 20.04 using Vivado 2017.2.

We have seen this problem with the client computers being variously a macbook running OSX, a laptop running Arch Linux, and a desktop computer running Ubuntu 20.04. We have seen it on two separate ALPHA250s. We have also observed the same problem on a Red Pitaya.

This problem is fairly baffling to us.

It has the flavour of Nagle's algorithm/delayed ACK problems, but a) Nagle is disabled in the server, b), we see in wireshark that the client is not doing delayed ACKs, c) this should not cause indefinite stalls, d) one would not expect pinging the network interface to cause the stalled connection to become unstuck if it were about Nagle and ACKs, and e) how could the FPGA bitstream affect this?

We also investigated if it is an ARP cache problem, and it seems not: both computer and ALPHA250 have correct entries for each other in their ARP tables during a stall.

Given that pings and unrelated network activity on the same interface seem to "wake up" the stalled connection, we are wondering if the NIC has gone into some powersaving mode prematurely, or if there is a bug in the driver or Linux network stack. We continue to investigate (I have not yet found any way to control powersaving functionality of the NIC, if there is any), but thought it was worth a post here to see if others have seen similar problems, or if there are any known issues that may help us understand or resolve the problem.

For reproducing the problem, it is possible we might be able to privately provide (to Koheron) an OS SD card image and instrument zip file that display the problem, but it is unlikely we will be able to provide source code. If this is absolutely necessary then I can look into what we may be able to share, but I am hoping there might be some ideas based on my description of the problem and what we've tried.

Any thoughts would be appreciated!

chrisjbillington commented 1 year ago

A bit more info, and possible workaround.

Firstly, the FPGA program changes that influence whether and how often the hangs occur may not be so arbitrary after all. We think it may have something to do with the amount of data being read from the FPGA and sent to the client in one go. On the Red Pitaya when we saw the issue, this was 65536 bytes. On the ALPHA250 when we saw the issue, it was 131072 bytes (well, two functions, each reading 131072 bytes from a different memory block - we haven't been able to reproduce it with an FPGA configured to transfer a single 131072-byte block of memory). When we saw no issue, it was on an ALPHA250 reading 262144 bytes at once. So it's possible that the problem occurs when reading and transferring less data at once, for some reason.

Secondly, we managed to start packet capture on the ALPHA250 during a hang, and could see the outgoing packets as the hang resumed. We saw an ARP broadcast that triggered the hang to end, but did not see the missing 760-byte packet thereafter. Combined with what we saw with packet capture on the client, this seems to imply the 760-byte packet gets stuck in a buffer somewhere downstream of packet capture on the server, and upstream of packet capture on the client.

Thirdly, we tried sending a packet that had the wrong mac address to the ALPHA250's ethernet controller. This was a check to see if this would "wake up" the stalled connection, since such a packet should be filtered by the NIC and not reach the OS. This did not cause the connection to become unstuck.

Finally, our potential workaround. We installed ethtool on the ALPHA250 and have disabled "scatter-gather" functionality on the NIC:

ethtool -K eth0 sg off

"scatter-gather" is something about how packets get assembled by the NIC out of possibly non-contiguous blocks of memory, pointers to which are passed to the NIC by the kernel. When it's disabled, the kernel is probably copying all the data into contiguous memory first, and the NIC doesn't have to be as clever. Or something like that.

We are yet to see the connection get stuck with the NIC's scatter-gather functionality disabled. However, since the stalls are probabilistic, I'm not fully convinced yet - we might have just gotten lucky. But we'll leave it like that for now (added a "post-up" line to /etc/network/interfaces to make the setting persist) and see if we get any more hangs in the future.

So at this point I'm suspicious of a bug either in the KSZ9031 chip firmware or a driver. Since we saw the problem on a Red Pitaya as well, I wanted to check what NIC chip it uses, and it seems to use a network controller built into the Zynq SoC, which I see at one point was subject to a kind of similar-sounding bug, since fixed.

Edit: I see the KSZ9031 is actually a physical-later gigE transceiver, not a full NIC, which leaves me confused. Perhaps both the ALPHA250 and Red Pitaya are using the network controller built into the SoC, which would more match them being subject to the same bugs,

rsarwar87 commented 1 year ago

I have not tried RP yet, but I have no issues on alpha 250 or microzed or ultrascale. On one design I am sending 25 mb data, and I allow the kernel to do as it like.

Edit: I however compile our own kernel as using petalinux. And not through the sdk. Except for alpha250, which I take from the sdk.

Edit2: how are you reading data from fpga? We use dma. If you are using dma, have you ensured that the Ram was correctly partitioned?

chrisjbillington commented 1 year ago

Yes, we're using DMA, and the RAM is probably partitioned however Koheron have it partitioned - I'd have to check with my colleague who does the FPGA side of things.

I think this bug I linked in the previous comment sounds like it might be what we're experiencing, and the FPGA changes may only be relevant to the extent that they modify timing to the point where the rare conditions are met to trigger the bug. Though I can't quite tell on the silicon revisions page whether this bug has supposedly been fixed or not in the silicon we have on our ALPHA250s and Red Pitayas.

Anyway, if any of the hardware Koheron's SDK works with is subject to this bug, and if the disabling of scatter-gather does in fact work around it effectively (I'll report back if it looks like it does or doesn't), then perhaps the Koheron OS images should add the ethtool -K eth0 sg off line as an interface "post-up" command to work around the problem.

tvanderbruggen commented 1 year ago

From what I understand you only observe stall when using the on-board Ethernet controller via direct link. So basically stall happens when the connection is the "fastest" (somehow disabling "scatter-gather" might also slow down the connection).

Do you run a thread in your acquisition driver ? Because some race condition might occur with much higher probability when running the Ethernet faster, in that case the server won't crash but the acquisition thread may stall ... What happens if you make the repetitive calls faster or slower that 50 ms ?

chrisjbillington commented 1 year ago

Yes, it's occurred to me that anything that seems to make a difference could just be due to its effect on the link's latency, which indeed is at its lowest when we see the issue. I haven't yet tried using different lengths of Ethernet cable, but I considered it!

No, the acquisition driver doesn't spin up a thread. It's pretty simple, something akin to the following:

constexpr uint32_t mydata_size = mem::mydata_range/sizeof(uint32_t);

class MyDriver
{
  public:
    MyDriver(Context& ctx_)
    : ctx(ctx_)
    , ctl(ctx.mm.get<mem::control>())
    , sts(ctx.mm.get<mem::status>())
    , mydata_map(ctx.mm.get<mem::mydata>())
    //...
    {
        //...
    }

    void start_mydata_acquisition() {
        ctl.clear_bit<reg::trig, 0>();
        ctl.set_bit<reg::trig, 0>();
    }

    bool mydata_ready() {
        return sts.read_bit<reg::mydata_ready, 0>();
    }

    std::array<uint32_t, mydata_size> read_mydata() {
        return mydata_map.read_array<uint32_t, mydata_size>();
    }

 private:
    Context& ctx;
    Memory<mem::control>& ctl;
    Memory<mem::status>& sts;
    Memory<mem::mydata>& mydata_map;
    //...
};

With the client essentially doing

while True:
    client.start_mydata_acquisition()
    while not client.mydata_ready():
        time.sleep(0.01)
    data = client.read_mydata()

We did some testing initially with configuring it to send data every 10ms, and every 1s, and in both cases we still saw hangs but not unambiguously more or less frequent - the hangs are random so it's hard to be sure, but I would say the hangs are probably not 5× more frequent when sending data every 10ms, and not 20× less frequent when sending data every 1s. I can do more testing later to confirm this - it does seem pretty weird that the probability of the hang would not scale with the frequency of data transfers.

tvanderbruggen commented 1 year ago

I don't see an issue with the acquisition code.

When you use the direct link do you always use the same cable ? In that case can you check you have 1 GbE enabled (command dmesg | grep eth0 | grep up must return link up (1000/Full). It can happen that you have 100 Mb or 10 Mb ethernet enabled, which is often the sign of a faulty cable ... That would explain the absence of scaling with the data rate because cable faults will occur independently of the data transfer.

chrisjbillington commented 1 year ago

We see it with multiple different cables. Definitely at least two cables, as my colleague and I both see the problem on our own setups, with our own computer, ethernet cable and ALPHA250, and we work in different cities. But I'll check nonetheless whether we see a non-gigabit link, when I next can.

Maybe worth mentioning that my colleague's setup showed the issue much more frequently than mine, which would be consistent with the link latency hypothesis - his computer is more powerful than mine (so maybe it sends ACKs a tiny bit faster...?). Or maybe both cables were bad but one was worse.

chrisjbillington commented 1 year ago

Checking the dmesg log from a computer that has been powered on for a few days (the computer with which we saw the most hangs), we see that the link was almost always gigabit. There are two occasions where it came up at 100mbit, and then went down, and then went up at 1 gigabit a few seconds later. I can't tell what we were doing at the time, it's possible this was whilst we were actively messing with network settings.

We now have that ALPHA250 in a different lab and connected via a different ethernet cable to the same macbook (and type-c to ethernet adapter) that we have previously seen the issue with, but less frequently. The mac generates some extra traffic on the interface which unhangs the connection, so the hangs are also harder to spot than with the Linux clients. But we haven't yet seen hangs in this context with the longer cable.

We power-cycled the ALPHA250 many times, and on one occasion (the first), we did see that the link came up at 100mbit, and stayed there. We weren't able to reproduce this again, however.

chrisjbillington commented 1 year ago

Just a few more notes.

Was testing with a Red Pitaya (and yet another different ethernet cable), and also saw hangs, and also saw no hangs after disabling scatter-gather. Have not seen the link degrade to 100Mbit. In the case of the Red Pitaya, as mentioned we are attempting to read 65536 bytes at a time instead of 131072 as with the ALPHA250, and we see that when data transfer gets stuck, it's waiting for the final 384 bytes, as opposed to 760 in the case of an ALHPA250.

In case it's a silicon bug, the details of the chip we have on the Red Pitaya I'm testing on are as follows:

image

Whilst the problem does look a lot like the bug I linked above, this is marked as only affecting engineering samples, which we don't appear to have. And I am not seeing any other known silicon bugs that it looks like it could be. So either an unknown silicon bug, or not a silicon bug.

Anyway, a mystery for now, but we'll leave scatter-gather disabled on our devices and report back if we ever see the problem again.

Edit: I see a Xilinx support thread in which a similar, unfixed bug is discussed, related to data getting stuck in a receive buffer, instead of a send buffer. Someone in the thread is saying, however, that they are seeing TX data getting stuck in the same way. This sounds like the same issue we are seeing.