Xilinx / embeddedsw

Xilinx Embedded Software (embeddedsw) Development
Other
936 stars 1.07k forks source link

Timing hazard found in lwIP ethernet driver (SDK 2019.1 Zynq processor) #179

Open TomRosenbauer opened 2 years ago

TomRosenbauer commented 2 years ago

I discovered an apparent timing hazard in the low level Ethernet driver that does not correctly handle the scatter-gather DMA (SGDMA) ring descriptor if 2 packets are sent within 5 usec of one another (which is the case for a TCP Status message in my application). The work-around was to just add a few usec of delay which empirically was found to resolve the issue. A more correct (but much more difficult) resolution would be to take a deep dive into the handling of the SGDMA ring descriptors.

Below is the “smoking gun” evidence that shows the two Ethernet packets for each TCP status message being handed off to the low Ethernet driver through the LWIP protocol stack . The one highlighted in green is sent correctly. The very next one exhibits the timing hazard, as evidenced by only the first packet getting sent, while the second one is stuck in a DMA buffer, which does not get sent until the next message comes along 5 seconds later. As you can see from the WireShark capture, there are 3 packets sent at that time: the one from the previous message, along with 2 for the new message. The Yellow box highlights the available memory for a message before it is sent, and a value less than 65535 confirms that a previous message was not completely sent. image

The fix was made in module: ~/libsrc/lwip211_v1_0/src/contrib/ports/xilinx/netif/xemacpsif.c As you can see from the pathname, it is part of the Xilinx-specific port of the LWIP stack. The modification I made below is part of the “custom repo” of the BSP (board support package) that comprises the OFP build environment.

if (is_tx_space_available(xemacpsif)) {
    _unbuffered_low_level_output(xemacpsif, p);
    /*
     * TJR 12/8/21 Empirically determined that a couple
     * of usec of delay is needed to reliably send
     * back-to-back packets (which is the case for
     * the AOEW Tx TCP Status message). If packets
     * arrive here in less than 5usec, the second packet
     * gets delayed. The uneven 5/10 second status
     * message pacing is attributed to this.
     */
    usleep(3);
    err = ERR_OK;
} else {

However, I believe the problem actually is in the SGDMA routines, which are called in the _unbuffered_low_level_output() function that is called before my modification. And specifically, I believe that there is a race condition somewhere in the buffer descriptor handling, which is a delicate dance between software running on the A53 processor and the DMA engine that also can access and change the buffer descriptor.

Coding-xinxiangwanji2017 commented 2 years ago

Hi,您好:    我收到了您邮件,等我看完邮件以后,第一时间给您回复。

HariniKatakamX commented 2 years ago

Thanks @TomRosenbauer. We'll look into it. Could you please confirm if you ever observed a TX Used Bit Read warning shortly before this issue occurred?

TomRosenbauer commented 2 years ago

I have not observed a "TX Used Bit Read warning", but then again, I haven't looked too hard, and currently do not have any of the lwIP debug messaging turned on -- let me know if I can provide any additional trace log information (I can insert minimally intrusive trace messages into the driver code with usec resolution). Thanks for taking the taking the time to investigate this further.