EttusResearch / uhd

The USRP™ Hardware Driver Repository
http://uhd.ettus.com
Other
997 stars 666 forks source link

X310 fails with "x300 fw poke32 - reply timed out" #611

Closed CJCombrink closed 4 months ago

CJCombrink commented 2 years ago

Issue Description

During runtime we sometimes get the following reported in the console:

SSSSU[ERROR] [X300] 192.168.40.2: x300 fw communication failure #1
EnvironmentError: IOError: x300 fw poke32 - reply timed out

Afterwards all calls to tx_stream->send() times out and no data getting transmitted (the send function returns 0 after 100ms).

Setup Details

utils/uhd_usrp_probe --args addr=192.168.40.2
[INFO] [UHD] linux; GNU C++ version 10.3.1 20210422 (Red Hat 10.3.1-1); Boost_106600; UHD_4.2.0.HEAD-0-g46a70d85
[INFO] [X300] X300 initialization sequence...
[INFO] [X300] Maximum frame size: 8000 bytes.
[INFO] [X300] Radio 1x clock: 200 MHz
  _____________________________________________________
 /
|       Device: X-Series Device
|     _____________________________________________________
|    /
|   |       Mboard: X310
|   |   revision: 11
|   |   revision_compat: 7
|   |   product: 30818
...
|   |   FW Version: 6.0
|   |   FPGA Version: 38.0
|   |   FPGA git hash: 8daa80c
|   |   RFNoC capable: Yes

Expected Behavior

X310 should not stop sending data, or should recover and start sending data again.

Actual Behaviour

The error is reported and sending data stops completely.

Steps to reproduce the problem

The issue can be reproduced using the "tx_waveforms" example and iperf sending data to the device.

  1. Run the tx_waveforms example
    ./examples/tx_waveforms  --rate 10e6 --freq 1e6 --nsamps 100000000 --args="type=x300,addr=192.168.40.2"
  2. Send iperf data to the device
    iperf -c 192.168.40.2 -u -b 1000m -t 1 -p 1234
  3. Observe that the application never exits (--nsamps is never reached since tx_stream->send() returns 0).

Additional Information

Using iperf is just a convenient way to reproduce an issue that we see sporadically during "normal" operation.

Edit: After testing it became clear that the send() function times out after the timeout period expired.

CJCombrink commented 2 years ago

Is there a sensible way to detect this, and then recover? I have tested with the following and it seems to work, but is it correct or is there a better option?

if(nr_send == 0)
{
    tx_stream.reset();
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    tx_stream = usrp->get_tx_stream(stream_args);
}
michaelld commented 2 years ago

You have a valid issue that I'd love to see get fixed. In my experience, the issue more broadly means there's something going on with networking between the host computer and the X310. That said, if UHD could reset the USRP's networking as you note -- and I don't know if that good code or not -- then streaming might be able to resume. -That- said, check the networking to make sure it is robust: try a direct connection if you're using a switch between the host computer and the USRP; try different cables -- ENET or DAC or fiber; try different adapters if ENET or fiber. Try a different NIC on the host computer, or a different computer with a similar NIC. It's likely that with all of these checks something will come up as not working correctly.

michaelld commented 2 years ago

@michael-west @wordimont what do you think of this code change? Is there another way to reset the streaming to allow data to flow again when this issue happens?

wordimont commented 2 years ago

I don't know if there's a better way to detect and recover, but I'm not super familiar with what options the API provides. I'm curious if we can reproduce this or if it really is just an unreliable connection like you suggested.

@CJCombrink how quickly does this occur when running tx_waveforms with iperf?

CJCombrink commented 2 years ago

@wordimont It happens immediately after I run iperf.

CJCombrink commented 2 years ago

Any update on this perhaps?

CJCombrink commented 2 years ago

More findings: If we call get_tx_stream immediately after send() returns zero we get the following exception:

Error: EnvironmentError: IOError: Timed out getting recv buff for management transaction

(as per the code in my previous comment)

For it to actually work I need a delay between the time that send() returns zero and I call the restart code

if(nr_send == 0)
{
    std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    tx_stream.reset();
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    tx_stream = usrp->get_tx_stream(stream_args);
}

(almost anything less than the above 1 seconds sleep results in the exception). Edit: It appears that any one of the two delays shown can be 1second then the reset will work

mbr0wn commented 11 months ago

Running iperf in the way you are describing it will most likely crash the ZPU (I think). That will shut down your device and the x300 fw poke32 - reply timed out is then the expected result.

Now I realize that you are obviously not running iperf in normal operation, but I wonder if you have a network configuration that causes a lot of spurious traffic to slam into the X310. I'm not certain this is what's happening, or what such a network setup would look like, but there may be a difference between your setup and most other people's setup.

mbr0wn commented 4 months ago

I'm closing this, as I don't think there's much we can do here. To go back to the original error:

SSSSU[ERROR] [X300] 192.168.40.2: x300 fw communication failure #1
EnvironmentError: IOError: x300 fw poke32 - reply timed out

This indicates packet loss on the Ethernet interface (SSS). If a claimer packet (communication between X310 firmware and UHD) gets lost, the session is killed and no more streaming is possible. Attempting to fix the session loss would be futile given the connection itself seems compromised.

kazim425 commented 1 month ago

This is problem with uhd version. This error disappears with UHD 4.7 version