CHERIoT-Platform / network-stack

5 stars 1 forks source link

Restart the TCP/IP stack on crash. #27

Closed hlef closed 1 month ago

hlef commented 3 months ago

The TCP/IP stack currently has a lot of state and cannot recover gracefully in case of failure. This commit addresses this limitation.

We introduce a custom error handler for the TCP/IP stack. At a high-level, this error handler terminates all user threads currently present in the TCP/IP compartment and resets the IP thread to its entry point ip_thread_entry. It does so by setting all the locks of the compartment for destruction and notifying all futex waiters. Then it frees all heap allocations and resets globals. After all the state has been cleaned up, it restarts the network stack.

At this stage, this is more of a RFC than a proper PR for merge. Some documentation TODOs are still there, which I will process in the coming days.

This PR was tested as following:

diff --git a/source/FreeRTOS_ICMP.c b/source/FreeRTOS_ICMP.c
index 78cc9c6..226866f 100644
--- a/source/FreeRTOS_ICMP.c
+++ b/source/FreeRTOS_ICMP.c
@@ -71,6 +71,8 @@

 #if ( ipconfigREPLY_TO_INCOMING_PINGS == 1 ) || ( ipconfigSUPPORT_OUTGOING_PINGS == 1 )

+static int counter = 0;
+
 /**
  * @brief Process an ICMP packet. Only echo requests and echo replies are recognised and handled.
  *
@@ -98,6 +100,16 @@
             /* coverity[misra_c_2012_rule_11_3_violation] */
             ICMPPacket_t * pxICMPPacket = ( ( ICMPPacket_t * ) pxNetworkBuffer->pucEthernetBuffer );

+       if (counter > 0)
+       {
+               printf("(!) triggering a crash\n");
+               *((volatile int *) 0x0) = 0;
+       }
+       else
+       {
+               counter++;
+       }
+
             switch( pxICMPPacket->xICMPHeader.ucTypeOfMessage )
             {
                 case ipICMP_ECHO_REQUEST:

Note that this ignores the first ping received by the system as one ping may be sent by the network in the first boot phase. You may thus need to send two pings to trigger the crash depending on your network setup. Feel free to adjust this.

diff --git a/lib/tcpip/network_wrapper.cc b/lib/tcpip/network_wrapper.cc
index 00510a6..8201159 100644
--- a/lib/tcpip/network_wrapper.cc
+++ b/lib/tcpip/network_wrapper.cc
@@ -483,6 +483,7 @@ int network_host_resolve(const char     *hostname,
      -1 /* invalid if we are restarting */);
 }

+static int counter = 0;
 SObj network_socket_create_and_bind(Timeout       *timeout,
                                     SObj           mallocCapability,
                                     bool           isIPv6,
@@ -562,6 +563,16 @@ SObj network_socket_create_and_bind(Timeout       *timeout,
              localAddress.sin_family = Family;
              auto bindResult =
                FreeRTOS_bind(socket, &localAddress, sizeof(localAddress));
+             if (counter == 1)
+             {
+                 counter++;
+                 printf("(!) triggering a crash in socket create\n");
+                 *((volatile int *)0x0) = 0;
+             }
+             else
+             {
+                 counter++;
+             }
              if (bindResult != 0)
              {
                  // See above comments.

The TCP/IP stack should detect that the crash happened with the lock held, forcefully unlock it, and continue its business.

Unmerged dependencies in the main RTOS tree. We need the following two PRs, which will be merged soon:

Known issue:

diff --git a/source/FreeRTOS_TCP_WIN.c b/source/FreeRTOS_TCP_WIN.c
index 8296657..65a02cb 100644
--- a/source/FreeRTOS_TCP_WIN.c
+++ b/source/FreeRTOS_TCP_WIN.c
@@ -430,6 +430,7 @@

             if( xTCPSegments == NULL )
             {
+                printf("Allocation of sectors failed.\n");
                 FreeRTOS_debug_printf( ( "prvCreateSectors: malloc %u failed\n",
                                          ( unsigned ) ( ipconfigTCP_WIN_SEG_COUNT * sizeof( xTCPSegments[ 0 ] ) ) ) );
davidchisnall commented 3 months ago

It looks as if our BearSSL submodule is pointing at a commit that's after the release branch. I think we did that because there was a bug fix that we needed after the release, but apparently the server has changed its configuration to not permit clones of individual commits that are not branches / tags?

hlef commented 3 months ago

@davidchisnall I have addressed all comments, and rebased the PR. This is ready for another round of review :slightly_smiling_face:

hlef commented 3 months ago

It looks as if our BearSSL submodule is pointing at a commit that's after the release branch. I think we did that because there was a bug fix that we needed after the release, but apparently the server has changed its configuration to not permit clones of individual commits that are not branches / tags?

For the record, this was addressed in https://github.com/CHERIoT-Platform/network-stack/pull/28

hlef commented 1 month ago

As discussed with @davidchisnall, this looks fine to merge now. It's working well. It has known limitations, which are documented in the README.md. I will open an issue to keep track of these limitations, and ensure that we end up addressing them.