apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.79k stars 1.16k forks source link

Assertion fails while freeing IOBs. #8533

Open fjpanag opened 1 year ago

fjpanag commented 1 year ago

I constantly get the following assertion to fail:

mm/iob/iob_free.c:100

      /* Copy and decrement the total packet length, being careful to
       * do nothing too crazy.
       */

      if (iob->io_pktlen > iob->io_len)
        {
          /* Adjust packet length and move it to the next entry */

          next->io_pktlen = iob->io_pktlen - iob->io_len;
          DEBUGASSERT(next->io_pktlen >= next->io_len);
        }

At least for me, the failure seems to be triggered while I am using a custom TCP server (mostly when connections are closed).

I haven't been able to find any specific commit that is suspicious.

Unfortunately, I am also not able to reproduce the issue reliably.
It randomly happens, sometimes almost immediately when connections are closed, sometimes everything seems fine for extended periods of time.

I would appreciate any hint as to what to look for to fix this.


As a reference, this is the stack trace at the moment of the failure:

image

(Moved from the apps repo, where I accidentally opened the issue.)

xiaoxiang781216 commented 1 year ago

@anchao could you take a look?

anchao commented 1 year ago

@fjpanag which defconfig are you using? emm you can also share the nuttx/.config if you have config changes

fjpanag commented 1 year ago

@anchao I am testing on a custom app, and I am not using any of the defconfigs in apps repo.

I just managed to have this issue triggered on my MQTT broker too.
I am running a "stupid" stress-test script on the broker, and it fails.

I can trigger this reliably with CONFIG_NET_ALLOC_CONNS enabled, but not without it.
Maybe it is just a coincidence, but according to the behaviour that I am experiencing I am almost convinced that dynamic allocations are the problem.

If you really need to, I can provide you with a defconfig, similar to what I am using, but I am not sure whether I will manage to reproduce this on any of the standard apps.

fjpanag commented 1 year ago

Attached you may find the stress-test script.

Note! If you run the script, and leave it alone, everything seems OK.

To reproduce the issue I have to run the script, kill it ([Ctrl]-C), and restart it, rapidly and repeatedly.
After 2-3 quick restarts of the script, the issue appears.

If you think it may help, I can try to get some good Wireshark captures of the traffic during the issue.

But for the moment, I haven't been able to understand how all these are related, and where to concentrate my troubleshooting.

stress_test.txt

anchao commented 1 year ago

Thanks for the detailed information, I will try to setup the MQTT environment to reproduce this issue, and please also provide the defconfig your mentioned, I need confirm which options are selected abort CONFIG_NET_TCP* part

fjpanag commented 1 year ago

@anchao The attached defconfig reproduces the issue.

defconfig.txt

fjpanag commented 1 year ago

@anchao Did you have and progress on this?

Or maybe, can you share any thoughts so I can help troubleshooting this more effectively?
You may suggest any more ASSERT()'s to be placed, so I can test and help pin-point the issue?

anchao commented 1 year ago

@anchao Did you have and progress on this?

Or maybe, can you share any thoughts so I can help troubleshooting this more effectively? You may suggest any more ASSERT()'s to be placed, so I can test and help pin-point the issue?

I've added some debug info but can't seem to reproduce the issue successfully, can you enable some debug info to reproduce the issue?

CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_WARN=y
CONFIG_IOB_DEBUG=y

and add printing of ipv4 proto:

diff --git a/net/devif/ipv4_input.c b/net/devif/ipv4_input.c
index 7a04d96dc2..3defcdd4e4 100644
--- a/net/devif/ipv4_input.c
+++ b/net/devif/ipv4_input.c
@@ -398,6 +398,8 @@ static int ipv4_in(FAR struct net_driver_s *dev)

   /* Now process the incoming packet according to the protocol. */

+  syslog(1, "ipv4 input proto: %d\n", ipv4->proto);
+
   switch (ipv4->proto)
     {
 #ifdef NET_TCP_HAVE_STACK
fjpanag commented 1 year ago

@anchao sorry it took me so long... I had some "distractions", but I also spent some time on this issue.

I am testing against a3a1883787a873577c96c6dd1723b139f46f5e52. I can reproduce the issue much easier on this one, rather than the latest revisions. (I believe that specific timings make this bug more or less apparent).

Anyway, I added the print in ipv4_in() as you requested. The output is always:

[27/03/23 18:59:58] [ 1] [ ALERT] ipv4 input proto: 6

No other protocol has been observed near the time that the assertion fails.

I tried to also enable CONFIG_IOB_DEBUG.
However, when this option is enabled (and thus syslog buffer disabled), the issue is not reproducible any more.

I am still trying, but in the meantime you may provide me with other suggestions, to try and narrow down on the root cause.