Asynchronous message buffer overflow due to back pressure

justham101 commented 7 years ago

Hi Bastian,

Host System Details:

OS: Ubuntu 14.04LTS 64-bit
Processor: Intel Xeon E3-1575M v5 @ 3GHz x 8
Memory: 31.1GB
Graphics: Nvidia Quadro M2000M/PCIe/SSE2
SSD: 500GB M.2 PCIe
Kernel: 3.19.0-80-generic

SDR Hardware and Software Details:

2x Ettus USRP B205mini-i (USB3)
UHD_3.11.0.git-94-g5964adcd
GNU Radio Companion v3.7.10.1-237-g81e7af7b
Freshly installed via PyBOMBS

I have modified your original "wifi_transceiver" example to work with my two Ettus USRP B205mini-i (see the flowchart below) and have added Control Port for performance monitoring. For testing my SDRs are connected using a 30dB attenuation loopback cable. I currently just have it set up to send from one and receive on the other.

wifi_transceiver_flowgraph

I am running into an issue where the OFDM Carrier Allocator block in the transmitter-half of the WIFI PHY Hier is acting as a bottleneck when transmitting messages. It prevents me from strobing messages with a period smaller than around 5ms before I receive the message

gr::log :WARN: tpb_thread_body - asynchronous message buffer overflowing, dropping message

Looking at the Control Port Performance Monitor output below, I can see a considerable proportion (around 80%) of the computation time being spent on the _ofdm_carrier_allocatorcvc block. Eventually its input buffer fills to 100%, applying back pressure on the adjoining buffers.

Perhaps unrelated, I have also noticed from the edge colours that the receiver side is consistently allowing frames to progress through to the WIFI Sync Long block from the WIFI Sync Short. Is it worth me adjusting this block's threshold?

wifi_transceiver_controlport

I have followed your setup guide by enabling real-time scheduling (this removed most instances of 'O' overruns I previously encountered in 10MHz and 20MHz bandwidth modes) and running volk_profile. This last test is where I found some interesting results. Below I have attached some highlights which show considerable time (anything over 1000ms) being taken for certain operation types (especially _volk_32fc_s32f_power32fc). My machine was classified as _avx2_64_mmxorc. Possibly this ties into the bottleneck? I have also run volk_profile on my i7-6500U @ 2.50GHzx4 system running Ubuntu 16.10 and found similar/worse results.

RUN_VOLK_TESTS: volk_8u_conv_k7_r2puppet_8u(131071,198)
spiral completed in 232.946ms
generic completed in 2472.62ms
offset 0 in1: 1 in2: 0 tolerance was: 0
offset 2 in1: 1 in2: 0 tolerance was: 0
offset 4 in1: 1 in2: 0 tolerance was: 0
offset 5 in1: 1 in2: 0 tolerance was: 0
offset 6 in1: 1 in2: 0 tolerance was: 0
offset 7 in1: 1 in2: 0 tolerance was: 0
offset 8 in1: 1 in2: 0 tolerance was: 0
offset 9 in1: 1 in2: 0 tolerance was: 0
volk_8u_conv_k7_r2puppet_8u: fail on arch spiral
Best aligned arch: generic
Best unaligned arch: generic

RUN_VOLK_TESTS: volk_16ic_magnitude_16i(131071,1987)
a_sse3 completed in 1662.64ms
a_sse completed in 2827.65ms
generic completed in 2161.98ms
Best aligned arch: a_sse3
Best unaligned arch: generic

RUN_VOLK_TESTS: volk_16ic_s32f_magnitude_32f(131071,1987)
a_sse3 completed in 1377.88ms
a_sse completed in 1227.62ms
generic completed in 873.992ms
Best aligned arch: generic
Best unaligned arch: generic

RUN_VOLK_TESTS: volk_32fc_s32f_power_32fc(131071,1987)
a_sse completed in 28649.9ms
generic completed in 28690.7ms
Best aligned arch: a_sse
Best unaligned arch: generic

RUN_VOLK_TESTS: volk_32fc_s32f_atan2_32f(131071,1987)
a_sse4_1 completed in 5470.39ms
a_sse completed in 6953.64ms
generic completed in 5586.28ms
Best aligned arch: a_sse4_1
Best unaligned arch: generic

RUN_VOLK_TESTS: volk_32fc_s32f_power_spectrum_32f(131071,1987)
a_sse3 completed in 5249.58ms
generic completed in 5222.9ms
Best aligned arch: generic
Best unaligned arch: generic

RUN_VOLK_TESTS: volk_32f_s32f_power_32f(131071,1987)
a_sse4_1 completed in 14412.2ms
a_sse completed in 14403.9ms
generic completed in 14371.9ms
Best aligned arch: generic
Best unaligned arch: generic

RUN_VOLK_TESTS: volk_32f_8u_polarbutterflypuppet_32f(131071,1987)
generic completed in 39522.7ms
u_avx completed in 4507.59ms
Best aligned arch: u_avx
Best unaligned arch: u_avx

If I want to make use of the 20MHz bandwidth and potential 54Mbps connection speed of this standard, surely I would need to be able to send messages considerably more often than every 10ms? Or am I missing something obvious? For testing I was using varying payload sizes all the way up to 1499 Bytes and periods as low as 1ms before the overflow occurs (theoretically 12Mbps?). I originally tested a Socket PDU UDP server and python script instead of the message strobe and found similar results.

Hopefully you can help me understand what is going on. I look forward to your reply and questions. Thanks!

bastibl commented 7 years ago

Hi,

I think the TX side was never really optimized for speed. There should be a paper that looked into in. IIRC, they found out that the memset in the OFDM Carrier Allocator causes much of the overhead.

http://www.ccs-labs.org/bib/arcos2016accelerating/ https://github.com/gnuradio/gnuradio/blob/next/gr-digital/lib/ofdm_carrier_allocator_cvc_impl.cc#L147

In theory, I could copy the block and change it, but I didn't want to do that, since I actually rewrote the transmitter to work with the upstream OFDM blocks. I asked the guy to submit a pull request to GNU Radio, but I think that never happened.

If the receiver pipes frames into the flow graph all the time (even though you don't send any frames) that usually means that you have some DC offset/LO leakage. The receiver uses the auto correlation of the signal to detect frames (and trigger further processing). You could try connecting an antenna and space the devices a bit further. I've never seen such problems with an B210, but I als didn't use a cable recently.

Regarding the bandwidth, the 54Mpbs is the maximum physical layer throughput. You could only reach it if you send one frame with infinite length. With some overhead per frame and inter-frame spacing of the the MAC, the actual throughput of WiFi is lower.

justham101 commented 7 years ago

Hi Bastian,

Thanks for the quick reply!

So I had a read of the paper you recommended, which led me to the git repository containing the modified OFDM Carrier Allocator and OFDM Mapper blocks https://github.com/gonza1207/gr-ieee802-11. Since I installed GNUradio and gr-ieee-80211 completely via PyBOMBS, is there a best way for me to test out their fork? I tried cloning the repo into a separate folder from your installation

~/PyBOMBS/src/optimised-80211/gr-ieee802-11
vs
~/PyBOMBS/src/gr-ieee802-11

and performed the following from its build folder (giving cmake the path to the pybombs installation of gnuradio) but the blocks appear to be missing when I go into GRC. Any tips?

cmake ..
make
sudo make install
sudo ldconfig

Once the code is modified I'll also test everything out with my 850MHz-6GHz log period PCB antennas to see if that solves the receiver over-activity. I might just bring the volk_profile question up on the gnuradio mailing list.

Thanks for your help so far.

bastibl commented 7 years ago

Hi, I never used PyBombs. Maybe you can bring this up on the GNU Radio mailing list. There are lots of people you might know a good workflow for this.

bastibl / gr-ieee802-11

Asynchronous message buffer overflow due to back pressure #68