RFNoC/X310: send() sits in infinite loop

bpadalino commented 6 years ago

Issue Description

During a send() call, RFNoC blocks may get stuck in an infinite loop and not adhere to the timeout specified.

Setup Details

UHD rfnoc-devel branch on an X310 with a C++ application running on top of UHD. Custom FPGA, but the transmit chain is not custom. Using the DRAM DMA FIFO -> DUC -> Radio - all standard blocks. All transmission bursts are sent for immediate transmission and the whole burst is supplied to send() at one time.

Expected Behavior

When performing a send() call with a timeout of 3 seconds, I expect that if the operation hasn't completed within 3 seconds, the call is returned and the timeout is conveyed to me.

Actual Behaviour

It appears that an infinite loop can be entered when things don't quite work as expected. The backtrace of the infinite loop:

#2  uhd::transport::wait_for_recv_ready (timeout=0.10000000000000001, sock_fd=<optimized out>) at /src/uhd/host/lib/transport/udp_common.hpp:59
#3  udp_zero_copy_asio_mrb::get_new (index=@0x9e6b38: 48, timeout=<optimized out>, this=<optimized out>) at /src/uhd/host/lib/transport/udp_zero_copy.cpp:79
#4  udp_zero_copy_asio_impl::get_recv_buff (this=0x9e6ac0, timeout=<optimized out>) at /src/uhd/host/lib/transport/udp_zero_copy.cpp:226
#5  0x00007ffff753f4fa in tx_flow_ctrl (fc_cache=..., async_xport=..., endian_conv=0x7ffff754e076 <uhd::ntohx<unsigned int>(unsigned int)>, unpack=0x7ffff78a49db <uhd::transport::vrt::chdr::if_hdr_unpack_be(unsigned int const*, uhd::transport::vrt::if_packet_info_t&)>) at /src/uhd/host/lib/usrp/device3/device3_io_impl.cpp:345
#6  0x00007ffff754f29c in boost::_bi::list5<boost::_bi::value<boost::shared_ptr<tx_fc_cache_t> >, boost::_bi::value<boost::shared_ptr<uhd::transport::zero_copy_if> >, boost::_bi::value<unsigned int (*)(unsigned int)>, boost::_bi::value<void (*)(unsigned int const*, uhd::transport::vrt::if_packet_info_t&)>, boost::arg<1> >::operator()<bool, bool (*)(boost::shared_ptr<tx_fc_cache_t>, boost::shared_ptr<uhd::transport::zero_copy_if>, unsigned int (*)(unsigned int), void (*)(unsigned int const*, uhd::transport::vrt::if_packet_info_t&), boost::intrusive_ptr<uhd::transport::managed_buffer>), boost::_bi::list1<boost::intrusive_ptr<uhd::transport::managed_buffer> const&> > (a=<synthetic pointer>, f=<optimized out>, this=<optimized out>) at /usr/include/boost/bind/bind.hpp:515
#7  boost::_bi::bind_t<bool, bool (*)(boost::shared_ptr<tx_fc_cache_t>, boost::shared_ptr<uhd::transport::zero_copy_if>, unsigned int (*)(unsigned int), void (*)(unsigned int const*, uhd::transport::vrt::if_packet_info_t&), boost::intrusive_ptr<uhd::transport::managed_buffer>), boost::_bi::list5<boost::_bi::value<boost::shared_ptr<tx_fc_cache_t> >, boost::_bi::value<boost::shared_ptr<uhd::transport::zero_copy_if> >, boost::_bi::value<unsigned int (*)(unsigned int)>, boost::_bi::value<void (*)(unsigned int const*, uhd::transport::vrt::if_packet_info_t&)>, boost::arg<1> > >::operator()<boost::intrusive_ptr<uhd::transport::managed_buffer> >(boost::intrusive_ptr<uhd::transport::managed_buffer>&&) (a1=<optimized out>, this=<optimized out>) at /usr/include/boost/bind/bind.hpp:905
#8  boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<bool, bool (*)(boost::shared_ptr<tx_fc_cache_t>, boost::shared_ptr<uhd::transport::zero_copy_if>, unsigned int (*)(unsigned int), void (*)(unsigned int const*, uhd::transport::vrt::if_packet_info_t&), boost::intrusive_ptr<uhd::transport::managed_buffer>), boost::_bi::list5<boost::_bi::value<boost::shared_ptr<tx_fc_cache_t> >, boost::_bi::value<boost::shared_ptr<uhd::transport::zero_copy_if> >, boost::_bi::value<unsigned int (*)(unsigned int)>, boost::_bi::value<void (*)(unsigned int const*, uhd::transport::vrt::if_packet_info_t&)>, boost::arg<1> > >, bool, boost::intrusive_ptr<uhd::transport::managed_buffer> >::invoke (function_obj_ptr=..., a0=...) at /usr/include/boost/function/function_template.hpp:138
#9  0x00007ffff78ab9e6 in boost::function1<bool, boost::intrusive_ptr<uhd::transport::managed_buffer> >::operator() (a0=..., this=0xfac718) at /usr/include/boost/function/function_template.hpp:773
#10 zero_copy_flow_ctrl_msb::release (this=0xfac6f0) at /src/uhd/host/lib/transport/zero_copy_flow_ctrl.cpp:46
#11 0x00007ffff7553b24 in uhd::transport::intrusive_ptr_release (p=<optimized out>) at /src/uhd/host/include/uhd/transport/zero_copy.hpp:104
#12 boost::intrusive_ptr<uhd::transport::managed_send_buffer>::~intrusive_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:97
#13 boost::intrusive_ptr<uhd::transport::managed_send_buffer>::reset (this=0x9e7488) at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:141
#14 uhd::transport::sph::send_packet_handler::convert_to_in_buff (index=0, this=0xfab088) at /src/uhd/host/lib/usrp/device3/../../transport/super_send_packet_handler.hpp:430
#15 uhd::transport::sph::send_packet_handler::send_one_packet (buffer_offset_bytes=127744, timeout=0.5, if_packet_info=..., nsamps_per_buff=1996, buffs=..., this=0xfab088) at /src/uhd/host/lib/usrp/device3/../../transport/super_send_packet_handler.hpp:388
#16 uhd::transport::sph::send_packet_handler::send (timeout=0.5, metadata=..., nsamps_per_buff=<optimized out>, buffs=..., this=0xfab088) at /src/uhd/host/lib/usrp/device3/../../transport/super_send_packet_handler.hpp:263
#17 uhd::transport::sph::send_packet_streamer::send (this=0xfab080, buffs=..., nsamps_per_buff=<optimized out>, metadata=..., timeout=0.5) at /src/uhd/host/lib/usrp/device3/../../transport/super_send_packet_handler.hpp:462

The issue resides in the call to zero_copy_flow_ctrl_msb::release() waiting for the return of the flow control call to be true. The call to tx_flow_ctrl() can wait indefinitely if, for some reason, the space has shrunk down to zero, and there are no more flow control credits coming back.

Without the call to send() ever returning, the program is locked up and there are no programmatic ways to tear down, or try to restart the stream, and alleviate the issue.

Steps to reproduce the problem

We believe there may be a bug in the underflow handling of the DMA FIFO -> DUC -> Radio as we always see an underflow occur as the last async message that comes in. We are also running around 100MHz worth of RF coming back into the host during the transmission times.

Fortunately, I think you can just look at the code and see that the infinite loop is capable of occurring if, for some reason, the FPGA doesn't send back flow control credits.

We are trying to see if changing the error policy to next_packet instead of next_burst will change the behavior, but we are not hopeful.

bpadalino commented 6 years ago

Changing the error policy didn't seem to have much of an effect, but it was only put on the Radio block, not every block in the chain. I'm not sure if that makes a difference or not.

It appears, from the logs I'm creating, that the last async message received during the failure mode is that an underflow occurred. Then, after a few seconds (presumably filling up the DRAM FIFO) the send() goes out and never returns.

I'm going to try to remove the DRAM FIFO from the graph, and I'll update the ticket when I have sufficient information.

michael-west commented 6 years ago

@bpadalino Thank you for reporting the issue and the detailed description. Yes, the flow control should probably have a timeout. The down side of adding a timeout is that it adds code in the critical path that checks time, which is an expensive operation and would add latency that could, in turn, result in more underruns. But the API does imply a timeout should be enforced, so we will look into adding it.

But the root cause of why it ran out of space is probably more important to your application. The default size for the DRAM FIFO is 32 MB, so that would most likely align with your presumption that it stops after filling the DRAM FIFO. That can happen if the timestamp in the header of the packet is much later than the current time on the device, so I would suggest starting by validating the information in the tx_metadata_t and the time on the device (be sure to check the times on all devices if using multiple units).

bpadalino commented 6 years ago

@michael-west The timestamp is not being supplied as to transmit "immediately". If I remove the DRAM FIFO from the chain, I get a lot more underruns, but send() never times out.

We are only using a single unit, but the timestamp has been verified to never be anything other than "immediately".

When an underflow occurs, how should this manifest itself in a flow graph of DRAM FIFO -> DUC -> Radio. Who drops what packets?

If I put together DUC -> DRAM FIFO -> Radio, do you see any issues or concerns with this type of flowgraph or should it essentially be the same thing?

michael-west commented 6 years ago

@bpadalino An underrun is reported by the Radio. It means the Radio needed data and there was none. It doesn't necessarily mean packets were dropped. The underrun does indicate that the Radio did see at least one packet of data, but the hang in the flow control indicates that somewhere the data was stopped.

The default for X310 is DMA FIFO -> DUC -> Radio when using the multi_usrp API and it definitely works. We have tested that extensively, at least on the master branch. You did not specify the commit hash, so I'm not sure how old the code is that you are using. All the RFNoC code from the rfnoc-devel branch has been merged into the master branch. Can you try using the head of the master branch and see if that works for you? Just make sure to add "-DENABLE_RFNOC" to your cmake command.

bpadalino commented 6 years ago

@michael-west A few questions:

If the error policy is next_burst then aren't packets dropped until the eob flag is hit internally to the radio? Saying no packets were dropped sounds strange. Can you describe what is supposed to happen during an underrun event when the DRAM FIFO is in place and has a backlog of multiple bursts? Does the DUC finish upconverting the burst? Where in the chain do the packets get dropped?
Have you verified that this is the case after the large 2017.4 change with the FPGA?
Does the radio support the continue error policy and to just continue the transmission when it has data again? Can you describe what error policies I can use, and how I apply them? Do I apply it to just the radio, or to all the rfnoc blocks in the chain? Can you give general guidance on what to do here?
How do you test underrunning the radio often, multiple times, and making sure it recovers properly? Can you test it while simultaneously receiving around 100MHz of bandwidth?
With a host only sending at around 10MHz sample rate, with the DMA FIFO in place before the DUC and the radio, do you believe I should ever get an underrun?

I'm on the latest rfnoc-devel before it got merged over to master. I can try master, but I don't have a lot of confidence it will magically work. My general experience with upgrading anything from UHD on the rfnoc-devel branch has been that something changes in a subtle way that breaks everything. The last time was the clear_tx_seqnum behavioral change, and no additional information/guidance about bus_clk/bus_rst being added to the axi_wrapper when porting to Vivado 2017.4.

bpadalino commented 6 years ago

A request for the future, while I upgrade UHD, can you please keep around implementations for different major numbers for different blocks?

The DDC changed significantly, but if I wanted to use the old DDC, and the controller code worked fine, why rip it out and change everything? Stability is key for me, and this type of throw-away development of working code seems very wasteful and makes it much more difficult to track down bugs and do bisections of code.

Now, it appears the noc_shell is at revision 2 and UHD supports revision 4 in master.

Lastly, I have to admit, it's a little disingenuous to say "it definitely works" when the code changes so much, so often. Either the code is stable and "it definitely works", or there's lots of changes all the time, and there may be an abundance of unknown bugs.

michael-west commented 6 years ago

During an underrun, packets are dropped in the Radio block depending on the error policy. The error policy is completely implemented in the Radio block, so the DMA FIFO and DUC operate normally during and underrun. The FPGA code for the TX control has not changed and all policies are supported.

Let me qualify "it definitely works" with "for every release of UHD" because we do test the releases in the DMA FIFO -> DUC -> Radio configuration.

The rfnoc-devel branch was/is a very unstable development branch that was made public to allow access to RFNoC before we merged it into mainline UHD. This also means things have changed in subtle and significant ways without warning. Now that the code has been merged into the master branch, all changes between releases will be recorded in the changelog.

For us to replicate and do a thorough debug of the issue, please send a code example that demonstrates it. The default image has all the required components, so your custom FPGA image is not needed. Please also let us know exactly which commit your are using.

bpadalino commented 6 years ago

I'm trying to use 98057752006b5c567ed331c5b14e3b8a281b83b9 and the usrp_x310_fpga_RFNOC_XG.bit with hash 73236dd2384ac7a9dba3e6c5ac1845e7 and I get the error:

[INFO] [UHD] linux; GNU C++ version 7.2.0; Boost_106200; UHD_3.14.0.0-31-g98057752
[INFO] [X300] X300 initialization sequence...
Error: RuntimeError: Expected FPGA compatibility number 35, but got 33:
The FPGA image on your device is not compatible with this host code build.
Download the appropriate FPGA images for this version of UHD.
Please run:

 "/opt/uhd-git/lib/uhd/utils/uhd_images_downloader.py"

Then burn a new image to the on-board flash storage of your
USRP X3xx device using the image loader utility. Use this command:

"/opt/uhd-git/bin/uhd_image_loader" --args="type=x300,addr=192.168.40.2"

For more information, refer to the UHD manual:

 http://files.ettus.com/manual/page_usrp_x3x0.html#x3x0_flash

My downloader thinks everything is fine. What am I doing wrong?

bpadalino commented 6 years ago

I tried building the fpga repo at c398755579f01a1d020e742fb7d129950939a5dc (uhd/fpga pinned commit) and c7015a9a57a77c0e312f0c56e461ac479cf7f1e9 (fpga/master). I can run uhd_usrp_probe and see the blocks there, but after about 1 second of running ./benchmark_rate --args="addr=192.168.40.2" --tx_rate 1e6 --duration 900 I get the following:

[00:00:04.208977] Setting device timestamp to 0...
[00:00:04.222935] Testing transmit rate 1.000000 Msps on 1 channels
U[ERROR] [X300] 192.168.40.2: x300 fw communication failure #1
EnvironmentError: IOError: x300 fw poke32 - reply timed out
[ERROR] [X300] 192.168.40.2: x300 fw communication failure #2
EnvironmentError: IOError: x300 fw poke32 - reply timed out
[ERROR] [X300] 192.168.40.2: x300 fw communication failure #3
EnvironmentError: IOError: x300 fw poke32 - reply timed out
[ERROR] [UHD] An unexpected exception was caught in a task loop.The task loop will now exit, things may not work.EnvironmentError: IOError: 192.168.40.2: x300 fw communication failure #3
EnvironmentError: IOError: x300 fw poke32 - reply timed out

I did verify the front of the X310 does show a little blip of TX for a small period of time, then everything does seem to go dead. Can't ping or probe again and requires a new FPGA reload to work again.

These are the results when working off 98057752006b5c567ed331c5b14e3b8a281b83b9, UHD's current master commit. Also note that I can't use a prebuilt FPGA image since the compatibility numbers are different.

michael-west commented 6 years ago

That is indeed odd behavior. Is it possible that the network interface is not stable?

If you checkout the v3.13.0.1 tag, that will have matching pre-built FPGA images that you can use the uhd_images_downloader and uhd_image_loader utilities to fetch and load. The benchmark_rate command you used does work on that version. If you can post the original code that is failing for you (or some sample code that we can use to reproduce it), we can try to use that to reproduce the issue and debug it.

bpadalino commented 6 years ago

@michael-west The network interface is fine. I can probe multiple times in a row without an issue. Running benchmark_rate causes the immediate failure.

The original code that's failing simply calls send() with bursts where the metadata start_of_burst and end_of_burst are true.

I noticed the benchmark_rate doesn't send an immediate stream (has_time_spec is true in the program) and only sends a short end_of_burst at the end. Independently, can you change every burst to be immediate (has_time_spec set to false) and set both start_of_burst and end_of_burst to be true for the duration of the test and see how things go for you - independent of my testing?

michael-west commented 6 years ago

The use case for this issue was doing repeated TX and RX bursts by running a shell script that looped on executing an application that would handle one burst. After some random number of iterations, the application would hang.

@bpadalino reported that the issue is not present on the head of the master branch, so we are closing it for now. The potential infinite loop in the TX flow control is still part of the code on the head of master. Since it is not the root cause of this issue, a separate internal issue has been created to track it.

Thanks for reporting @bpadalino!

michael-west commented 6 years ago

Re-opening issue. Other users reported the same issue on the mailing list along with a proposed fix:

http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/2018-August/057721.html

Need to investigate the proposed fix.

bpadalino commented 6 years ago

Adding a comment here for clarity and to update on what I have seen lately. Regardless of what may be happening in the FPGA, the host side code has infinite loop tendencies and the timeout supplied to send() is not respected. This should get fixed no matter what.

As for my other observations, I tried the proposed fix and, while it may still be a bug, I don't think it actually worked. Originally I had thought it did, but then I saw it again, and I believe the bug shows itself due to thrashing of the DMA FIFO.

My current scenario setup: I generate bursts of IQ data, in full, and send the bursts, in full, to UHD. I set both the start_of_burst and end_of_burst flags when I give UHD all of my data. This burst generation process is asynchronous, and I feed UHD whenever I get the bursts with a timestamp to just send immediately.

If I have enough incoming data to generate bursts such that the DMA FIFO never runs dry, then this problem never appears to occur. If I have bursts back to back such that the FIFO runs dry inbetween bursts, it is more likely to show up. I have not been able to confirm if the underflow logic is hit every time the last send call hangs or not.

michael-west commented 6 years ago

@bpadalino Thanks for the update. We agree that the infinite loop should be fixed. It is on the list.

Can you provide example code and a procedure to reproduce the issue so we can reproduce it and troubleshoot?

bpadalino commented 6 years ago

Unfortunately I can't. I just don't have the time right now. I've changed my code to avoid sending bursts and went to a continuous model with zero stuffing between to avoid underflows and keep the DMA FIFO sufficiently full.

I think it's best to go back to the revision that we both could reproduce the problem. I believe you could reproduce it beforehand, yes?

Brian

michael-west commented 6 years ago

@bpadalino Yes, I was able to reproduce it at one point. I found my past notes and example code, so we will continue troubleshooting. I'm glad to hear you have a workaround in place in the meantime.

zz4fap commented 5 years ago

is there a fix for this problem?

zz4fap commented 5 years ago

Has the proposed fix: http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/2018-August/057721.html been tested? Does it work?

michael-west commented 5 years ago

@zz4fap There is no resolution on this issue yet. It is still under investigation. It is not clear why the proposed change to the FPGA would make a difference and the change looks like it could potentially cause other issues.

bpadalino commented 5 years ago

I tried the fix, but it didn't really help me out. It kept getting stuck.

My fix was to put the DMA FIFO in place and resize it to a reasonable amount of latency and just keep it fed in a continuous mode using zero stuffing.

My belief is that the DMA FIFO, when it thrashes, gets into a situation where there is a request for more data from the downstream radio but is ignored. I haven't used the ILA to find the root of the problem since the workaround was good enough.

michael-west commented 5 years ago

@bpadalino Thanks for the additional information.

zz4fap commented 5 years ago

@bpadalino and @michael-west thanks!

michael-west commented 5 years ago

I can confirm that the issue does appear to be in the DMA FIFO. The problem does not reproduce when using the skip-dram device argument.

zz4fap commented 5 years ago

but I dumb question, what happens when we skip it? does the usrp still work?

On Wed, Nov 21, 2018 at 7:17 PM michael-west notifications@github.com wrote:

I can confirm that the issue does appear to be in the DMA FIFO. The problem does not reproduce when using the skip-dram device argument.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EttusResearch/uhd/issues/203#issuecomment-440763322, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUVw5izAyVkMnZBK0oGTuWjbs6mBB5vks5uxZibgaJpZM4V0BH_ .

michael-west commented 5 years ago

It works, but there will be underruns at higher sample rates.

zz4fap commented 5 years ago

I see. Thanks. Is there any plan to fix this issue soon? On Wed, Nov 21, 2018 at 7:48 PM michael-west notifications@github.com wrote:

It works, but there will be underruns at higher sample rates.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

michael-west commented 5 years ago

We hope to fix it soon, but it will take as long as it takes to find the root cause. If this is a show-stopper for your application, please contact support@ettus.com and reference this issue.

michael-west commented 5 years ago

Correction: the device argument is "skip_dram" (not "skip-dram").

zz4fap commented 5 years ago

OK, I will report this to ettus support. in my case skip dram won't work as I use the RF timed commands... thanks

On Wed, Nov 21, 2018 at 11:52 PM michael-west notifications@github.com wrote:

Correction: the device argument is "skip_dram" (not "skip-dram").

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EttusResearch/uhd/issues/203#issuecomment-440838152, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUVwytSGN0EF6JxqDH9UcGihA5RKFhRks5uxdlHgaJpZM4V0BH_ .

natetemple commented 5 years ago

@bpadalino Is this still an outstanding issue for you?

bpadalino commented 5 years ago

@natetemple Sorry for the late reply. It's still an outstanding issue for me, but I have a workaround in place so I haven't seen it happen in quite some time.

Are you able to reproduce this, or has this all just become stale?

michaelld commented 5 years ago

Ping on this issue ... any update?

mainland commented 5 years ago

I have a large application that can reliably trigger this bug using any version of UHD after 3.9.

Same-Writer commented 4 years ago

I'd like to leave a final comment on this issue prior to closing it out.

The send() function sitting in an infinite loop is a symptom, and it has become clear that there are multiple factors that could cause this. If you find yourself running into this, the above steps are a great recommendation. If skipping dram (i.e. synthesizing an SRAM fifo image) doesn't do the trick, I'd recommend emailing support@ettus.com directly with a description as well as a reproducing case of your issue.

bpadalino commented 4 years ago

Sorry, but I honestly find it unacceptable. There are C++ code paths which show infinite loops which have no timeouts in case the radio gets into a bad state. The C++ side of the issue is that send() can sit in an infinite loop and not adhere to the timeout it was given. Was this addressed at all?

Same-Writer commented 4 years ago

@bpadalino I understand. The above comment is not a dismissal of the issue(s) that result in the send() loop being discussed, rather an acknowledgement that the solution will come by root causing these on a case-by-case basis.

As I understand it, adding a check for a timeout to the send() loop would solve the problem we're discussing, but it would also have implications for the throughput rate of this loop. I believe that adding a timeout check increased the loop's execution time by a significant amount, resulting in a hobbled streaming rate.

I know you've been working with @michael-west in the past to get this issue reproduced. Would you be able to email me at the support email to help me reproduce the issue and put it in front of a few additional developers?

bpadalino commented 4 years ago

For the FPGA, the DRAM FIFO is the issue, and the problem is exacerbated when using timed bursts which allow the FIFO to run dry between bursts, or possibly running dry and immediately being re-filled. In my application, I alleviated this by having a 10ms buffer that I zero-stuffed manually.

For the C++ API side, what should an application developer do if they want to make the call but rely on a timeout so they can clean up and handle it gracefully? The issue I had when these problems would occur is that my transmit thread would get stuck and I couldn't shut down and re-start my application. It's just a robustness issue. The API states the timeout is adhered to, so do you just want to make a note saying the timeout is only a suggestion and it's possible to get stuck in send() under certain circumstances?

EttusResearch / uhd