ARMmbed / mbed-os

Arm Mbed OS is a platform operating system designed for the internet of things
https://mbed.com
Other
4.67k stars 2.98k forks source link

LWIP stack corruption - c++x11 #6262

Closed pauluap closed 6 years ago

pauluap commented 6 years ago

Target: DISCO_F769NI Toolchain: GCC_ARM 6.3.1 20170215

The background is that I'm working on an application that has a Modbus server component. I'm using modpoll (http://www.modbusdriver.com/modpoll.html) with the command modpoll <ip-address> -r 1 -c 72 -t 4:hex -l 0 to perform torture testing. I had been finding out that the connection fails, then the device isn't responsive anymore, sometimes not even to ping requests.

I managed to create a working minimal example at https://github.com/pauluap/mbed-os/tree/lwip_cx11_issue

The twist is that the issue doesn't appear when building with the -std=gnuc++98 (with the command mbed compile -m DISCO_F769NI -t GCC_ARM) but exhibits problems when building with the -std=c++11 option (I did this by running mbed export -i GCC_ARM -m DISCO_F769NI and then changing the two instances of -std=gnu++98 in the generated Makefile)

I ended up getting my hands on a logic analyzer. After attaching a logic analyzer, it seems that there may be multiple issues within this problem.

I'm attaching saleae logic analyzer captures as well as wireshark captures. You're more than welcome to download saleae Logic (www.saleae.com/downloads) and Wireshark to open and view the captures. capture_fail_201803041016.zip

One aspect of the problem seems to be that the transmit half of TCP stops working. completecapture In the complete capture screenshot above, it can be observed that the first attach signal leads to a period of more or less normal communication (the solidish gray bars), and then something happens. The client tries a few retries, closes the socket, and tries reopening the socket (after the solidish gray bars, there's spikes without accompanying spike in the Attach signal, that's the retry, then the Attach spikes represent the client closing and trying to reopen the socket).

singlenormalframe This is a capture of a single normal frame. The client kicker is the ISR signal which transfers handling out of the ISR context onto the event queue context (indicated by the Pending signal raise). The pending event is then handled, receiving a frame. The received frame is then "processed" (the dummy implementation only sets enough to generate a valid response then pushes a new event onto the event queue) and then finally transmitted. The socket op communications at the bottom are debug messages reporting on all calls of socket.recv and socket.send. In this particular screenshot, the communications are (from left to right) 8 = recv(&[0], 8) 4 = recv(&[8], 4) -3001 = recv(&[0], 8) -3001 = recv(&[0], 8) 153 = send(&[0], 153)

And the cycle is supposed to repeat ad infinitum. And it does for the gnu++98 build.

In the C++11 build though, things fall off the cliff. endofnormalcommunications The last socket operation here is a 153 = send(&[0], 153) and then silence.

failcommsrestart If I observe a retry, it seems that the incoming frame was captured, processed, and transmitted. However, it doesn't appear that the client ever received the transmission thus declares a timeout failures and retries. After a while, the client decides that it's a socket failure, closes the socket and tries to reopen the socket and continue communications. The same thing as the screenshot above happens, but with the addition of the Attach signal.

So, the issue seem to tie in with C++11. Why is there even a problem? The LWIP sources are all in C, correct? And that's still built using -std=gnu99

I've been working on developing my codebase all throughout this and other issues, and have a pretty extensive C++11 codebase that I would really love to preserve (aside from having to return to the bad old days of C++98), so I'm willing to help in getting to the bottom of this issue. As you can see, I have a logic analyzer that can assist. The changing data in the Modbus response is worrying though, pointing to some kind of memory address assignment issue within the GCC_ARM toolchain.

pauluap commented 6 years ago

I've done some analysis and experiments. I have a bunch of captures and stuff, but right now that's probably too much noise.

Right now, I'm thinking that the issue isn't C++x11 per se, but rather that there's a latent issue that the different code generation output exposed.

I used the logic analyzer to follow the call stack through to lwip_netconn_do_writemore - https://github.com/ARMmbed/mbed-os/blob/a6e27b1b866e7301adcc81f94ea85e0ee7fc63de/features/FEATURE_LWIP/lwip-interface/lwip/src/api/lwip_api_msg.c#L1491. I didn't check every single call path, but I'm pretty confident that all of them are the same, ultimately ending up at https://github.com/ARMmbed/mbed-os/blob/a6e27b1b866e7301adcc81f94ea85e0ee7fc63de/features/FEATURE_LWIP/lwip-interface/lwip/src/api/lwip_api_msg.c#L1563, calling API_EVENT(conn, NETCONN_EVT_SENDMINUS, 153).

I stopped at that point and then looked at the LWIP debug options. Enabling all the options led to the test application not crashing for a half hour before I stopped it. By comparison, a 'normal' build always ran into problems in less than five minutes. That makes me think that there's a race or resource constraint.

While enabling all debug options led to a stable (presumably) stack, I was able to observe a fault when enabling the debug options one by one. The group of (TCP_DEBUG, TCP_INPUT_DEBUG, TCP_OUTPUT_DEBUG, TCP_RTO_DEBUG, TCP_CWND_DEBUG, TCP_QLEN_DEBUG) was "too stable" for me to capture a failing trace, but I managed to capture a trace from a smaller group of TCP_INPUT_DEBUG and TCP_OUTPUT_DEBUG TCP_QLEN_DEBUG. All of them are attached to this entry as well as the Wireshark capture of the smaller TCP debug group 201803042119_tcp_smaller_group.pcapng.zip inet_debug.txt ip_debug.txt tcp_cwnd_debug.txt tcp_debug.txt tcp_input_debug.txt tcpip_debug.txt tcp_output_debug.txt tcp_qlen_debug.txt tcp_rto_debug.txt udp_debug.txt tcp_group.log.zip tcp_smaller_group.log.zip

I'm hoping for some guidance if I need to do a deep dive. I saw somewhere that there were potential issues around ISR event queues running out of slots. Is there a way for me to extend the various resources involved to check if there's a resource contention issue?

0xc0170 commented 6 years ago

@kjbracey-arm @mikaleppanen

kjbracey commented 6 years ago

We're not yet guaranteeing the codebase as a whole will work with C++11, but it seems unlikely such a subtle problem is directly to that. It may just be some sort of alignment/timing issue.

The first alarm bell is "STM F7" - you do have this fix in your tree, right? https://github.com/ARMmbed/mbed-os/pull/5720 - fix went into 5.7.2. Also possible we've not got to the bottom of that issue.

That problem came and went between builds - alignment of the Ethernet decriptors being what was actually the trigger condition.

If it really was the C++11 flag, any failure is reasonably swift, you could go for a binary chop on building some of the files C++03 and some C++11. For starters - if all the mbed OS files are compiled C++03 first, with only your own app files built as C++11, does it work? (Build everything C++11, then move stuff out of the BUILD directory, flip makefile to C++03 then recompile). Working your way through the object files in a binary search might pin down 1 particular file, which might give a clue.

But that's only supposing C++03 versus C++11 is really the difference. Have you tried flipping the build profile (debug/develop/release?)

For event queue depth etc - you can increase lwip stack size with JSON options defined in features/FEATURE_LWIP/lwip-interface/mbed_lib.json. I think most sorts of event queue/ISR queue exhaustion would trigger a clear assert failure.

You could try adding "dummy" timed calls to the socket sigio handler to see if it unsticks the client - if that works, it means the data's waiting, but it missed the event, and we can narrow it down to event loss rather than data loss.

pauluap commented 6 years ago

Thanks!

Yeah, I agree, if C++11 had any sort of code generation issue, I would think that it'd fall over immediately, or at least its stability wouldn't be affected by various debug messages

Looking at the tcp_smaller_group log, a normal cycle looks like this -

+-+-+-+-+-+-+-+-+-+-+-+-+-+- tcp_input: flags -+-+-+-+-+-+-+-+-+-+-+-+-+-+
tcp_receive: ACK for 278663415, unacked->seqno 278663262:278663415
tcp_receive: removing 278663262:278663415 from pcb->unacked
tcp_receive: queuelen 1 ... 0 (after freeing unacked)
tcp_output: nothing to send (0x0)
tcp_write(pcb=0x2000f4c8, data=0x200022d4, len=153, apiflags=1)
tcp_write: queuelen: 0
tcp_write: queueing 278663415:278663568
tcp_write: 1 (after enqueued)
tcp_output_segment: 278663415:278663568

It's too bad there's no timing information attached to this output. Normally, a cycle takes less than a couple of milliseconds, but this one has the client retrying three times, so about three seconds between the start of the last message and the socket closure

+-+-+-+-+-+-+-+-+-+-+-+-+-+- tcp_input: flags -+-+-+-+-+-+-+-+-+-+-+-+-+-+
tcp_receive: ACK for 278664027, unacked->seqno 278663874:278664027
tcp_receive: removing 278663874:278664027 from pcb->unacked
tcp_receive: queuelen 1 ... 0 (after freeing unacked)
tcp_output: nothing to send (0x0)
tcp_write(pcb=0x2000f4c8, data=0x200022d4, len=153, apiflags=1)
tcp_write: queuelen: 0
tcp_write: queueing 278664027:278664180
tcp_write: 1 (after enqueued)
tcp_output_segment: 278664027:278664180
+-+-+-+-+-+-+-+-+-+-+-+-+-+- tcp_input: flags -+-+-+-+-+-+-+-+-+-+-+-+-+-+
tcp_receive: duplicate seqno 1575716817
tcp_output: sending ACK for 1575716829
tcp_output: nothing to send (0x0)
tcp_output: nothing to send (0x0)
+-+-+-+-+-+-+-+-+-+-+-+-+-+- tcp_input: flags -+-+-+-+-+-+-+-+-+-+-+-+-+-+
tcp_receive: ACK for 278664180, unacked->seqno 278664027:278664180
tcp_receive: removing 278664027:278664180 from pcb->unacked
tcp_receive: queuelen 1 ... 0 (after freeing unacked)
tcp_output: nothing to send (0x0)
tcp_write(pcb=0x2000f4c8, data=0x200022d4, len=153, apiflags=1)
tcp_write: queuelen: 0
tcp_write: queueing 278664180:278664333
tcp_write: 1 (after enqueued)
tcp_output_segment: 278664180:278664333
+-+-+-+-+-+-+-+-+-+-+-+-+-+- tcp_input: flags -+-+-+-+-+-+-+-+-+-+-+-+-+-+
tcp_receive: duplicate seqno 1575716829
tcp_output: sending ACK for 1575716841
tcp_output: nothing to send (0x0)
tcp_output: nothing to send (0x0)
+-+-+-+-+-+-+-+-+-+-+-+-+-+- tcp_input: flags -+-+-+-+-+-+-+-+-+-+-+-+-+-+
tcp_receive: ACK for 278664333, unacked->seqno 278664180:278664333
tcp_receive: removing 278664180:278664333 from pcb->unacked
tcp_receive: queuelen 1 ... 0 (after freeing unacked)
tcp_output: nothing to send (0x0)
+-+-+-+-+-+-+-+-+-+-+-+-+-+- tcp_input: flags -+-+-+-+-+-+-+-+-+-+-+-+-+-+
tcp_receive: received FIN.
tcp_output: sending ACK for 1575716854
tcp_enqueue_flags: queuelen: 0
tcp_enqueue_flags: queueing 278664333:278664334 (0x1)
tcp_enqueue_flags: 1 (after enqueued)
tcp_output_segment: 278664333:278664333
tcp_input: packed for LISTENing connection.
tcp_parseopt: MSS
tcp_parseopt: other
tcp_parseopt: other
tcp_parseopt: NOP
tcp_parseopt: other
tcp_enqueue_flags: queuelen: 0
tcp_enqueue_flags: queueing 2429866508:2429866509 (0x12)
tcp_enqueue_flags: 1 (after enqueued)
tcp_output_segment: 2429866508:2429866508

I'm not completely sure that the client did not send a duplicate or it did send a duplicate because the server didn't send out the packets it was supposed to.

This is part of the capture that I think occurred at the same time (both capture and text are attached to my previous comment)

  51091 200.101586973  192.168.2.195         192.168.2.121         TCP      66          Query: Trans: 25004; Unit:   1, Func:   3: Read Holding Registers
  51092 200.309194392  192.168.2.195         192.168.2.121         TCP      66     [TCP Retransmission] 38852 → 502 [PSH, ACK] Seq=300037 Ack=3825460 Win=1072 Len=12
  51093 200.316429594  192.168.2.121         192.168.2.195         Modbus/TCP 207    Response: Trans: 25004; Unit:   1, Func:   3: Read Holding Registers
  51094 200.316662027  192.168.2.195         192.168.2.121         Modbus/TCP 66        Query: Trans: 25005; Unit:   1, Func:   3: Read Holding Registers
  51095 200.318554240  192.168.2.121         192.168.2.195         TCP      60     [TCP Dup ACK 51093#1] 502 → 38852 [ACK] Seq=3825613 Ack=300049 Win=1796 Len=0
  51096 200.729193013  192.168.2.195         192.168.2.121         TCP      66     [TCP Retransmission] 38852 → 502 [PSH, ACK] Seq=300049 Ack=3825613 Win=1072 Len=12
  51097 200.736426376  192.168.2.121         192.168.2.195         Modbus/TCP 207    Response: Trans: 25005; Unit:   1, Func:   3: Read Holding Registers
  51098 200.736626001  192.168.2.195         192.168.2.121         Modbus/TCP 66        Query: Trans: 25006; Unit:   1, Func:   3: Read Holding Registers
  51099 200.738503273  192.168.2.121         192.168.2.195         TCP      60     [TCP Dup ACK 51097#1] 502 → 38852 [ACK] Seq=3825766 Ack=300061 Win=1784 Len=0
  51100 201.565195954  192.168.2.195         192.168.2.121         TCP      66     [TCP Retransmission] 38852 → 502 [PSH, ACK] Seq=300061 Ack=3825766 Win=1072 Len=12
  51101 201.572431453  192.168.2.121         192.168.2.195         Modbus/TCP 207    Response: Trans: 25006; Unit:   1, Func:   3: Read Holding Registers
  51102 201.572616157  192.168.2.195         192.168.2.121         Modbus/TCP 66        Query: Trans: 25007; Unit:   1, Func:   3: Read Holding Registers
  51103 201.574491878  192.168.2.121         192.168.2.195         TCP      60     [TCP Dup ACK 51101#1] 502 → 38852 [ACK] Seq=3825919 Ack=300073 Win=1772 Len=0
  51104 202.573712933  192.168.2.195         192.168.2.121         TCP      54     38852 → 502 [FIN, ACK] Seq=300085 Ack=3825919 Win=1072 Len=0

So it seems that the client did send out a retransmission after 200ms of silence. Since it appears that the sequence numbers are relative within the client and server, doesn't seem possible to completely match up the STM logs and the wireshark logs without more analysis over longer sections.

pauluap commented 6 years ago

I just saw this - https://github.com/ARMmbed/mbed-os/pull/6241 Any relevance?

pauluap commented 6 years ago

Ok, update.

Just to get it out of the way, I quadrupled all the resource configuration items. As expected, that didn't make any noticeable impact.

What DID make an impact though was building -std=c++11 -O3. With the O3 build, I had it running for a hour before I stopped it. My wireshark capture had 3.6 million packets for an average of 1k packets per second and a 1090k bit/s average. I did see some client timeouts, socket closure and client retries, but the salient fact here is that the network stack didn't get corrupted.

So it isn't a C++11 thing after all (or at least it appears. Have to start hedging my bets here).

My feeling is that there's a race condition that leads to the stack bookkeeping getting corrupted. I had at least one trail where the stack didn't crash completely, it did (eventually) respond to pings and client retries did generate at least one response, but things were off kilter and being delayed constantly. Here's a screen capture of this, the capture itself is also attached. If you look at the timestamps of the RST packets, you'll see that they're separated on a human time scale - they're from me retrying the client on the command line wireshark_echo_then_retry 201803051047_repeated_retries.pcapng.zip

kjbracey commented 6 years ago

6241 is not relevant.

Wireshark has an option to show absolute sequence numbers - right click on the TCP header in a dissection window to get TCP options pop-up.

Delays strongly suggest either the TX or RX descriptor pointers in the Ethernet driver being out of sync. And coming and going between builds makes it smell very like #5720.

If it is exactly the same condition as #5720, the problem may go away if you force 32-byte alignment on the descriptors. You could mess around with the bit using __ALIGN_BEGIN at the top of stm32xx_emac.c. Maybe you could also try forcing misalignment

Looking at the driver again - both stm32xx_emac.c, and stm32f7xx_hal_eth.c - I think we should go through and add more DMBs. Basically every time we read or write the OWN bit, we are performing an acquire or release, so a DMB is required.

The way descriptor manipulations are shared between the HAL and the driver means they both need to worry about acquire/release semantics and barriers.

(acquire="load,dmb" to make sure we don't reorder anything before the check that the descriptor is available; release="dmb,store" to make sure we don't reorder anything after the store that gives the descriptor to DMA).

I only touched the immediate known failures in #5720 in the hope that the STM maintainers would inspect more fully and come up with a thorough fix, but nothing so far. So let's do it here.

Searching through:

Try those changes. I think that very last one is the actual failure point - most of those DMBs won't actually do anything on a chip as simple as the Cortex-M7, but that last one is vital to make sure the OWN bit is set in RAM when the DMA is restarted, else it will stop again immediately.

In #5720 we were very worried about alignment/timing - in the end we convinced ourselves the problem was alignment-related, so we did a crosscheck with adding NOPs instead of DMBs to make sure it wasn't just the case that an image layout change made it go away - it was specifically a NOP->DMB change that fixed it.

pauluap commented 6 years ago

Ok, I pushed two more commits to the issue branch in my repo. The first pass was the minimum of what you suggested. That actually made the -O3 build trigger a failure. I then went overboard with phase 2. There was already a bunch of alignment attributes but I went ahead and covered all that I could see even though the semaphores really has nothing to do with emac memory constraints, as well as more DMBs in the places you felt that they weren't strictly necessary. After the second phase, the -O3 build did run for 15 minutes before I stopped it. The -O0 build however failed just shy of five minutes.

I wasn't sure about forcing a misalignment. If I had a stable build and wanted to see if misalignment could make it unstable, then that could help pinpoint the problem, but I don't think we're there yet.

On the -O3 failure though, I caught a wireshark trace that I think buttress your theory that the TX/RX descriptor pointers got out of sync (I'm betting on TX because that's what's being repeated). In the screenshot below, you can see that the client accepted message transaction 10452, started a query for 10543, silence, push, push, give up. Then the STM still struggles to transmit transaction 10542, ARP dies, but comes back six minutes later. It took about 190 seconds from the first attempt at transmitting transaction 10452 to when it finally gave up. The stack could be losing the RX, but since I see dup acks from the client, could it be possible that the stack transmitted the packet, got the corresponding ack and cleared the ack requirement but somehow forgot to clear the packet itself? so the packet is stuck in the queue, it's not cleared because there's no ack for it to match up, when the client sends what it thinks are dup acks, the stack receives them but throws them out because the ack list has been cleared? I can't make any claims to know how things actually work though so take that with a grain of salt. 201803060321_post_arp_recovery_late_packets.pcapng.zip

Oh yeah, if you can get the STM folks to get rid of the trailing whitespace in their files, I wouldn't take it amiss :)

201803060323_post_failure_forgot_already_xmit

kjbracey commented 6 years ago

I think you missed the most (only?) important DMB.

We need a DMB before checking DMASR, same as HAL_ETH_TransmitFrame. ... that last one is vital to make sure the OWN bit is set in RAM when the DMA is restarted, else it will stop again immediately.

 /* Clear Segment_Count */
 EthHandle.RxFrameInfos.SegCount = 0;

 + __DMB();

 /* When Rx Buffer unavailable flag is set: clear it and resume reception */
 if ((EthHandle.Instance->DMASR & ETH_DMASR_RBUS) != (uint32_t)RESET) {
kjbracey commented 6 years ago

My own fault for being so long-winded - I had already fully written that, thinking only of the acquire/release semantics of the OWN bit, including a disclaimer that I couldn't spot the failure based on those DMBs.

Then I remembered that "kick the DMA process" was also a release store. And that was the one that was the actual problem for the transmit routine.

kjbracey commented 6 years ago

The trace is consistent with the ST device going deaf after hearing and responding to query 10452.

It sends the response, but doesn't hear the ack for it, so keeps retransmitting that response. The "spurious retransmission" note is indicative of "person sending this shouldn't have sent it if it heard all these packets in order".

And it doesn't hear query 10453, so the other end keeps retransmitting query 10453 - just marked "retransmission" because yes, it clearly isn't getting a response.

Eventually 195 decides 121 is on the blink so starts sending ARP queries, and gets no response, cos 121 is now deaf.

At some point while still trying to retransmit the response to 10452 the ARP cache times out on the client so it goes back to querying, and it isn't hearing the responses.

The missing DMB before the DMASR will achieve those symptoms:

kjbracey commented 6 years ago

The symptoms were similar for #5720 for writing, but easier to achieve - the TX DMA stops due to nothing to transmit. The problem occurred on the very first packet, trying to start TX. Eventually the second packet will cause DMA kick again, so it can recover. RX is harder to provoke as you need to fill the RX buffers, but once you've done it I think you're permanently stuck. Only the RX IRQ routine can kick the DMA, and no RX IRQs can occur if the DMA is stopped.

pauluap commented 6 years ago

Nah, the fault is mine as well. You did say it, I did read it, but my pattern matching skills has deteriorated a bit, been awake for awhile.

So, good news and bad news. I think we're getting somewhere. Now the STM isn't totally deaf, but something is still going awry. I also tried reverting both phases and only applying the DMB before the two DMASR reads, but again while not totally deaf, it's still on the blink.

The one with -O3 is the DMASR update with all the changes that I added - phase 3 so to speak. It did start limping and continued that way for a good while - so two screenshots here 201803060551_-03start_limping and finally fails enough that the client gives up completely 201803060551_-03finally_fail

The one with -O0 was built with just the DMB before DMASR. Didn't get totally deaf, but still bonkers 201803060550_-o0limp

Think that you're onto something though, the behavior is certainly different now.

kjbracey commented 6 years ago

I think this may be easy - HAL_ETH_RxCpltCallback only tells you "1 or more frames have been received", but the driver is using a semaphore and acting as if it was "exactly 1 frame received".

God I hate semaphores and their counting semantics. Never what you want.

Make _eth_arch_rx_task keep looping and calling _eth_arch_low_level_input until it returns NULL before going back to wait for the semaphore.

pauluap commented 6 years ago

Cool! going to do that - but just to make sure, I pushed my phase 3 commit so that it's clear what I did change.

kjbracey commented 6 years ago

Hmm, never spotted that TX underflow error kick. Brain currently can't compute what that achieves and what DMBs would be required...

pauluap commented 6 years ago

Heh, your secret mission of getting rid of semaphores did chalk up a win in equeue.

Sure seems like there's a bunch of issues wrapped up within this one, not just one single bug, the DMB - before - ETH_DMASR_RBUS and the eth_arch_rx_task semaphore sure seems to change the behavior.

Hopefully there's just one more issue to stomp out - this one is with -O0, comms cut out pretty abruptly, total deafness. 201803060626_dead_stop

Changes pushed in phase 4

kjbracey commented 6 years ago

So phase 4 improved behaviour with normal -O settings?

That loop until NULL wouldn't be correct if buffers were smaller than frames, so a frame could occupy multiple buffers - HAL_ETH_GetReceivedFrame doesn't distinguish between "there's nothing to read" and "I processed a buffer containing non-final part of a frame". Not sure how you would distinguish those.

Also not sure what the intended distinction between HAL_ETH_GetReceivedFrame and HAL_ETH_GetReceivedFrame_IT is. I see that the _IT one will actually work through multiple buffers until it gets a frame or sees reaches a DMA-owned descriptor. Maybe we should use that instead?

Maybe some sort of receive error? Are all possible RX error cases handled properly? Again, that would be more complex if multiple buffers per frame - could run out of descriptors mid-frame. Not sure what the conditions are without that. I guess errors in descriptors always have "LS" set - we're just not actually checking for them? What about in overall interrupt status - are we perhaps somehow failing to clear a sticky receive error condition so we get stuck?

pauluap commented 6 years ago

Yes, Phase 4 vastly improved the -O0 behavior, it used to die pretty fast, five minutes guaranteed, but I had seen death occur instantaneously. With Phase 4, it had been going for tens of minutes.

Hmm, I could use the logic analyzer to inspect RX errors if there are any and see if they're suspicious.

pauluap commented 6 years ago

Oh man, I was so ready to declare victory. I tried changing to HAL_ETH_GetReceivedFrame_IT and had a 30-minute run. I went to branch to squash and build a clean tree. I eliminated all the debug stuff I had added, took a test run, and it failed. It appeared that it had gone deaf 201803060832_gone_deaf_again

I then tried removing the while(not NULL) loop, but it failed with a different behavior 201803060838_without_loop_while_not_null

pauluap commented 6 years ago

Getting some sleep sure works wonders.

Allrighty, postmortem. TL;DR I got things working, looks good, probably need to trim the changeset to avoid performance regressions.

So, I have a few branches I'll refer to.

The one we've been debugging on https://github.com/pauluap/mbed-os/tree/lwip_cx11_issue Branch showing read contention issue commit https://github.com/pauluap/mbed-os/tree/stm32F7xx_isolate_read_contention Single commit changeset squash https://github.com/pauluap/mbed-os/tree/stm32F7xx_emac_one_changeset

So, I committed several crimes.

(acquire="load,dmb" to make sure we don't reorder anything before the check that the descriptor is available; release="dmb,store" to make sure we don't reorder anything after the store that gives the descriptor to DMA).

I was thinking stores. I ignored loads because I was thinking that by the time we get to anything interesting, the buffer would have been acquired. So that was the lightbulb moment, the original code https://github.com/pauluap/mbed-os/blob/91e6db1ea251ffcc973001ed90477f42fdca5751/features/FEATURE_LWIP/lwip-interface/lwip-eth/arch/TARGET_STM/stm32xx_emac.c#L315 does a buffer release, then reads from the descriptor. So rather than the issue being dmb,store, it's release,read. Not really anything to do with DMB if I understand things correctly. So anyway, firstly, the ordering is incorrect. Even if we swap the lines (316 and 317) that still doesn't guarantee code generation order. So I ended up adding a DMB, but really, I think that's overkill, DMB is the only memory barrier object that I know, there's probably a lighter one that can be used. Also, in the changeset we generated, there's an DMB after that if block, so that probably covers things.

So with all that settled, I went ahead and built -O{0, 1, 2, 3, s} All of them ran for a half hour before I stopped them. Signs were good, so I went ahead and ported the changeset to my project repo. I built my project completely C++11 with -O0. I ran it for a half hour and it worked. Then the REAL litmus test. I rebuilt my project to the original configuration of supporting 5 client threads. 5 turned out to be a bit much, but I have it running for more than a hour now with 4 threads. Averages are 1815 packets per second and 1964kbits/s. Not too shabby.

I did observe performance regressions, back when things were crashing, I could get packet transfer rates of 1015+ packets per second with the test app. With the fixes, that has dropped to just under 1k.

I think that of the changes, those that made observable impacts were: https://github.com/pauluap/mbed-os/commit/4d81d62949c2e5b0552eddc87198abcea123965d#diff-f70ce424a419205c3ab267622d0fd5bcL273 https://github.com/pauluap/mbed-os/commit/4d81d62949c2e5b0552eddc87198abcea123965d#diff-f70ce424a419205c3ab267622d0fd5bcL316 https://github.com/pauluap/mbed-os/commit/4d81d62949c2e5b0552eddc87198abcea123965d#diff-f70ce424a419205c3ab267622d0fd5bcL346

Although the change to HAL_ETH_GetReceivedFrame_IT was made after the eth_arch_low_level_input semaphore drain change. So that could have rendered the semaphore drain unnecessary. Also it's possible that not all the __DMB();s that I added were strictly necessary. I want to glory in the endrophins a while longer, so I'm not messing with minimizing the changeset. I also haven't looked through to see if there's any other similar issues - reads before releases that could get reordered or whatever.

If you have configurations that you'd like me to try out, I'm more than willing to do so.

You know of the story about Charles Steinmetz, the guy who billed Henry Ford an invoice for knowing where to mark the spot? - https://www.smithsonianmag.com/history/charles-proteus-steinmetz-the-wizard-of-schenectady-51912022/

I feel like you're Charles, all I need to do is keep feeding data, and you go all Midas, dropping nuggets of golden information. Thanks!

So I'll keep things running just to reassure myself. Please tell me what you need to do to close out this issue, anything I can do for the process?

kjbracey commented 6 years ago

Glad you've got it working, but I really don't think that re-order is a correct fix. I think we're not there yet.

I agree this is a bit cargo cult - one issue is that DMB isn't really correct anyway - it would be the tool if this were an SMP system with a coherent cache, and you were sharing lock-free data with another CPU, but you're actually talking to a peripheral. All this DMB stuff only really works because the cache is disabled, so it's kind of equivalent to that SMP case. The moment anyone turns the cache on, you'll need deeper system calls for cache coherence (but in the same sort of place). The "acquire" DMBs are currently effectively pointless, but mark a spot where you would have to have cache synchronisation. The "release" DMBs after writing to RAM are currently necessary, as they flush the M7's store buffer.

Now, this reordering:

dmarxdesc = EthHandle.RxFrameInfos.FSRxDesc;   
for (i = 0; i < EthHandle.RxFrameInfos.SegCount; i++) {
    dmarxdesc = (ETH_DMADescTypeDef*)(dmarxdesc->Buffer2NextDescAddr);
    __DMB();
    dmarxdesc->Status |= ETH_DMARXDESC_OWN;
}

That really no longer does the same thing. One thing to note is that we and the chip only actively manipulate the Status field of the descriptor, the buffers pointed to by the descriptor, and control registers. The other fields of the descriptor are intialised by us at start up, and never touched afterwards - both we and the chip read them, so no synchronisation is required after init. (For RX - for TX there's a length field we write to that the DMA reads).

I believe here we have just read the data from SegCount buffers/descriptors. All of those have been "acquired" by checking OWN was clear.

We now need to set the OWN bits again on those descriptors so the DMA can use them again (with a DMB between the buffer read and the OWN bit being set - this is a release store releasing the buffers), and this loop is simply following a fixed linked list to do that - Buffer2NextDescAddr is unchanged since we initialised it, and we don't need acquire semantics for it.

After this change, the list following is broken so we are not setting the OWN bit on the first descriptor we processed, and we are setting it on the descriptor after the last one we processed. Aren't we? That's going to have some weird effects - I guess it's managed to suppress the problem by throttling reception?

After giving you descriptor 3, it won't be able to reuse it until it loops round the buffer, gives you descriptor 2, and that prompts you to clear the OWN bit for descriptor 3. Not sure how you wouldn't mess up reception of descriptor 4 while processing 3 - you'd potentially be setting the OWN bit for it after the hardware had cleared it, which would block further reception. Maybe you process 3 fast enough to avoid that failure?

I think this has told us something, but I'm not sure what. By doing this with the OWN bits I think you have throttled the hardware to 1-packet-at-a-time, which has apparently managed to avoid the faulty path. I believe that after the first cycle through the descriptors, the hardware will always be suspended as you enter the RX IRQ handler - it tries to pre-acquire the next descriptor ready for the next frame immediately after each reception. We've eliminated some sort of race?

I'm also staring at GetReceivedFrame_IT to figure out how it differ from the non-IT. I think the code has just forked - people have been editing only 1 for bug fixes. I think it's only the loop that's supposed to be the difference, but I don't think the differences affect us, as I think we should only ever see both FS and LS set. (Maybe you could stick asserts in to check we don't go down the other paths?)

kjbracey commented 6 years ago

Ah, one little side thought - the acquire DMBs aren't totally pointless - they may not achieve anything useful on the hardware, but they do also function as compiler barriers, which could conceivably be important. There's a slight lack of volatile which the DMBs should overcome.

kjbracey commented 6 years ago

It's also interesting that throttling to 1-at-a-time on RX buffers has so little effect on performance. Suggests the driver really should not be trying to use multiple buffers on embedded systems. The chip apparently has a 2K RX FIFO, giving it quite a bit of buffer space in hardware.

Indeed, that 2K may go further than the 4 descriptors we've configured, as it's capable of putting multiple small frames in there, whereas the driver will use up 6K of buffers for 4 of your 66-byte query frames.

pauluap commented 6 years ago

Ok, not too surprising that I'm wrong. Right thought, wrong context.

Anyway, I just tried going back to this implementation which originally 'worked'

    dmarxdesc = EthHandle.RxFrameInfos.FSRxDesc;
    /* Set Own bit in Rx descriptors: gives the buffers back to DMA */
    for (i = 0; i < EthHandle.RxFrameInfos.SegCount; i++) {
        __DMB();
        dmarxdesc->Status |= ETH_DMARXDESC_OWN;
        __DMB();
        dmarxdesc = (ETH_DMADescTypeDef*)(dmarxdesc->Buffer2NextDescAddr);
    }

Running with a single client, it still worked fine, but with multiple clients this immediately died. I guess it was pure luck that I "fixed" it or I would have given up on that line of purist when the multiclient config failed. The "wrong" implementation had been running for > 6 hours. I agree it's wrong, nice bonus that it leads to interesting datum like you said. At least this won't kill us like xrays did the Curies

Ok, so I put in pin asserts in the three paths of GetReceivedFrame_IT All three paths did get entered. Interestingly enough, when I tried with both variants of buffer release, the 'wrong' variant hit GetReceivedFrame much more frequently than the bookend-DMB variant.

pauluap commented 6 years ago

Oh man, I just realized that I was actually changing from a(); a = a->next; to a=a->next; a(); Yeah, that would be totally wrong. I'll totally take it as the last-resort workaround though.

kjbracey commented 6 years ago

I see HAL_ETH_IRQHandler clears the interrupt bits after calling the driver.

I believe it should be doing that before calling the driver. Try making that change.

Cf the same mistake being made in STM serial code here: https://github.com/ARMmbed/mbed-os/pull/4734

pauluap commented 6 years ago

mm, maybe I'm piling on too many changes? I took a look at HAL_ETH_IRQHandler. I firstly moved the NIS clearing to the top of the function. Didn't work, I then moved the individual clearings to the top of their respective scopes. That didn't help either. I moved the NIS clearing back to its original location, leaving the _IT_T, IT_R, FLAG_AIS clearings before the driver calls. No dice.

It seems that when I first start up a single client, theres' no issues (I didn't wait long enough to be sure though) but once I start the second client, things go wrong quickly thereafter.

pauluap commented 6 years ago

Hmm, I looked at #4734 more closely, wasn't the issue there more that manual clearing shouldn't have occurred in the first place, that it would drop bytes?

kjbracey commented 6 years ago

I should probably alert ST guys to this ongoing investigation, in case they have input - don't think they've been mentioned yet: @ARMmbed/team-st-mcd

Anyway...

All three paths did get entered.

Really? So we're seeing finished (OWN=0) descriptors with [FS] and [0], not just [FS|LS]? Are you sure? Everything I can see suggests that should not happen, because our buffers should be big enough to store entire frames. If that's not true, we need to get to the bottom of it.

firstly moved the NIS clearing to the top of the function.

Ah, I didn't see that - I was only looking at the R and T bits, and thinking of clearing them just before driver calls. The NIS and AIS bits look tricky to the point of unhelpfulness. Fortunately not relying on them...

I guess the ordering doesn't actually matter because we are deferring work to a thread, and as long as that thread processes all data, there's no problem.

It's not quite the same case as #4734, but similar. The problem would be let's say the IRQ handler does all possible work, then an IRQ happens, then the IRQ is cleared, we could be stuck (or at least delay processing). In this case the IRQ handler only actually sets a semaphore - it will do all available work later after we return from the interrupt regardless of the order. Doesn't matter if we clear a 2nd RX IRQ.

kjbracey commented 6 years ago

FWIW, if the race was eliminated by 1-at-a-time, you should be able to achieve the same effect "cleanly" by setting ETH_RXBUFNB to 1 in stm32f7xx_hal_conf.h, if everything is working as intended. [Although wouldn't be at all surprised if that edge case hadn't been exercised]

kjbracey commented 6 years ago

Another thought - the "loop until NULL" in rx_task isn't quite right as low_level_input can return NULL due to being out of LWIP pool memory for received packets, as well as due to there being nothing in the hardware.

That would mean it could grind to a halt after an lwIP memory exhaustion.

You'll need to restructure. My suggestion - make it so HAL_ETH_GetReceivedFrame_IT is called in the rx_task, so it can see the result of that, and loop on it. Then low_level_input is called when we know there is a frame to process in EthHandle.RxFrameInfos, and that only returns NULL on memory failure.

pauluap commented 6 years ago

Ok, I returned to the minimal working example to look at the path selection of GetReceivedFrame_IT - making the changes here - https://github.com/pauluap/mbed-os/commit/5e465ee2906da3c3ba82e89c54b8eb3425f01084#diff-3e29886dbfca11481d3c74806b1add30L894 - ugh, sorry about the white space noise. Here, only the proxy_D2 one got entered image

This, however, is the output for my application which is set up the same as the MWE one... image And it's not like as if one signal got sent to three pins - image

Ok, I'll revert everything and see what's up with ETH_RXBUFNB. Then I'll look into your latest suggestion. Is there any of our changes so far that you feel that we positively must keep?

I suppose all the DMBs can stay, they won't actually hurt anything. So far the only actual code change we've made (I think?) is the switch to GetReceivedFrame_IT and the "loop until NULL"

On second thought, I won't revert everything, just the loop until null part, leave the DMBs as buffers. can potentially eliminate them later when we have something we're both happy with.

pauluap commented 6 years ago

No, wait, I think that the report on my app is junk, all three signals should only be high for a single NOP, the logic analyzer output for the app showing all three branches being entered is junk

kjbracey commented 6 years ago

There's no UI for it, but add &w=1 to the GitHub URL to suppress whitespace changes.

The DMBs can't hurt. The DMB before kicking the "poll demand" register to restart reception would be essential, if it weren't for the fact the hardware is supposed to restart itself on next frame. (It was essential for TX, where it doesn't automatically restart.)

It's vital that you process all data after getting 1 semaphore - the loop and the GetReceivedFrame_IT are both part of doing that, and that last suggestion should tighten that up further. It's essential we keep processing descriptors until we see OWN=1, and don't stop on an intermediate descriptor or memory exhaustion.

Okay, I'll try to figure out what's going on with those other paths being entered. I'm far from convinced the driver/HAL code is correct if they are, and I want to figure out why it's happening. Could you maybe get it to print out the whole Status word and DMASR when it goes into the unwanted paths?

kjbracey commented 6 years ago

Ah, yes, D0 and D1 are your serial port.

kjbracey commented 6 years ago

BTW, just throwing packets away immediately on memory exhaustion isn't ideal - we could conceivably achieve slightly better "short burst" performance by letting them stay in the RX buffers and trying to acquire pbufs again later. The problem is we don't get a cue as to when memory might be available again (afaik). We'd need a timer or something.

pauluap commented 6 years ago

Actually, no :) https://github.com/pauluap/mbed-os/commit/5e465ee2906da3c3ba82e89c54b8eb3425f01084#diff-56f071be94a4ae4d2962bfb796d8c5f5R42. I numbered them the way my logic analyzer saw them, not the pins on the board :) I'll figure that one out too, make sure that I'm clean there.

Ok, so ETH_RXBUFNB didn't work. I tried in both my MWE and my app, from a clean setup to keeping the DMBs No luck. I'll verify the paths and see if I can pull off the Status/DMASR

pauluap commented 6 years ago

oh, I WAS using the "D0" and "D1" elsewhere in the code for old debug stuff.

pauluap commented 6 years ago

Ok, you're validated, only one branch in GetReceivedFrame_IT is entered. I was getting signals from an unrelated debug operation.

I'll see about the rx_task restructure

pauluap commented 6 years ago

My refactor attempt didn't work out - https://github.com/pauluap/mbed-os/commit/4a736796568d8f9bccf4f79917925d0bf011aa78

kjbracey commented 6 years ago

Code looks fine. Sad it didn't work. Running out of suggestions.

Maybe the DMA RX has shut down altogether to some sort of unanticipated/unhandled error condition. Perhaps put in an ErrorHandler callback and check DMASR bits?

I note that neither the AIS nor Transmit interrupts are enabled so both those callbacks are kind of dummy at present.

Try turning enabling AIS and all the receive related subinterrupts in ETH_MACDMAConfig with __HAL_ETH_DMA_ENABLE_IT: AIS, RO, RPS, RWT, FBE. (but not RBU, which we expect).

kjbracey commented 6 years ago

Also, along the lines of the "dummy sigio" - how making the semaphore wait have a timeout so it polls occasionally, to survive interrupt loss? Including some sort of report to confirm that it did indeed find new data after a semaphore timeout.

pauluap commented 6 years ago

I sure was failing about at the end of the last session, glad I didn't drop a hot load in my shotgun and blow myself up.

Had a bit of a repeat rollercoaster start today.

Today, I thought that it was time to step back - when in doubt, collect more data. As I was going about that, I thought that I might as well as try the non-IT version of GetReceivedFrame. To my surprise, it actually worked on my minimal working example. I whittled things down to a minimal changeset pauluap@fb0d0ac19bc22e1f26d57731e48c3adbde8453ac

That performed admirably on the MWE build with -O3. Performance regressions are slight if at all - I saw 1008 packets per second.

I think that the whole series of events yesterday led to one amazing red herring. I think that I can reconstruct what happened

  1. things are broken
  2. I added the second bracketing __DWB() around the buffer release at the same time as I made the change from GetReceivedFrame to GetReceivedFrame_IT.
  3. This led to the magic generation output such that timings line up - sorta like how the gnu++98 codegen did it, I guess. I think that it is the fix
  4. Dropping the second __DWB(); from the release loop screws the magic, I think that's the smoking gun.

And then the comedy show starts. I'm still embarrassed about throttling the DMA channels, I was treating each line as black boxes and didn't actually LOOK to see what they said.

So ultimately, I think that my successful long test trails with different builds were a combination of two factors

  1. The DMA channels were screwed
  2. The loop in GetReceivedFrame_IT only executed once, because the DMA channels were screwed, it never got the opportunity to break things.

So... there we have it. At that point, I thought that we were good, but of course I should have known.

I transferred the minimal working change to my actual project and... umm, no, not the final fix.

Since I thought that this correcter fix didn't bork the DMA channels like my original unfix did, going down to single RX and TX buffers would eliminate any buffer management and work. I found out doing that almost immediately led to complete deafness, similar to one of the wireshark screencaps up above. I made the same changes in the MWE - set the TX and RX buffers to 1. Immediate deafness. Cool, good opportunity to collect data. I attached the following signals pauluap@f8951efbf844084fc2964cd5863c836fab59a1d5

Do we have the smoking gun? 201803071130_logic_analyzer_capture The sole pbuf-not-available assert stands out doesn't it? I believe that the gap in the trace corresponds to packet 24-25 201803071129_highlight_gap_in_logic_capture

Zooming in to a 'normal' sequence - which isn't actually normal because it's already screwed up its TCP state - 201803071130_normal_receive_maybe And the end 201803071130_oops_pubf_not_avail_is_fatal

One thought - I say "pbuf not available" - but that last semaphore signal seems pretty long, so it appears there were more activity in _eth_arch_low_level_input than normal. If it actually is due to pbufs not being available, how do I increase the pbuf resource directly? There's the MBED_CONF_LWIP_*_MAX stuff but they set up other things don't they? Would it be possible to just expand the pbuf pool alone?

As for the dummy sigio, hmm... yeah, I'll try that, but it seems that the TCP state mixup.. hmm, maybe the initial gap thing is a hint, I'll do that.

pauluap commented 6 years ago

Update: No, dummy sigio isn't the final piece. image The TCP state did get screwed too, so seems that we're dropping something somewhere? image

pauluap commented 6 years ago

I changed the TX buffer to 4 and left the RX buffer at 1. There was an initial misstep at the start image But afterwards, it did work till I stopped it. When I stop and retry, I don't always see the misstep, maybe half of the time.

Transferring the change to my application, things seemed to work with a single client, but died after a second client was added.

kjbracey commented 6 years ago

Aha. Clue!

TX buffers==1 is apparently making it transmit everything twice. Maybe this is a hardware glitch. I note that the HAL does not set the TER "transmit end of ring" bit on the last descriptor, but just links back to the first one. I think that might confuse the hardware - it may be preloading the next (ie same) descriptor and not in effect seeing the "OWN" bit it just wrote. Change HAL_ETH_DMATxDescListInit to set ETH_DMATXDESC_TER in Status on the last descriptor (and this should mean you don't need to set Buffer2NextDescAddr).

So, given that observed glitch for transmit without TER - receive descriptors do not end with RER. Maybe that's causing a problem? Definitely would with 1 descriptor.

lwip.pbuf-pool-size would increase the number of buffers, so would lift this limit.

I believe lwip attempts some space reclaiming if it sees that the pbuf pool is exhausted - the allocation failure may be triggering extra work inside lwip.

But we've apparently broken the driver here, right, not lwip? We're not getting any more receive interrupts at all after pbuf exhaustion, right?

pauluap commented 6 years ago

I tried this - pauluap@eda732f320de73f3b33733f263d1b5567152c907

It seems to have killed all TX, ARP doesn't work, don't think DHCP works either. I tried with 1 and 4 TX buffers.

Yes, I agree that the driver has been broken. I'm not sure if it's JUST the RX interrupts though. If that's the case, then the dummy sigio should have worked right? There's no more receive frames available.

lwip - I don't think that what I have seen so far points to anything within lwip, I mean we've observed changes just by messing with the driver, including "successful" comms, so seems like if the stream is good, lwip is good (?)

Hmm, I could also track what paths are being taken in HAL_ETH_GetReceivedFrame after pbuf exhaustion

pauluap commented 6 years ago

image

This is with one TX/RX buffer, without the TER change. So after PBUF exhaustion, no more OWN bits. Does that mean there's no interrupts?

pauluap commented 6 years ago

Does netif->input() reset rx interrupts? if a pbuf is simply not available and we "decide" to drop it, then theoretically, that shouldn't stop RX interrupts right?