Closed hlef closed 1 month ago
It looks as if our BearSSL submodule is pointing at a commit that's after the release branch. I think we did that because there was a bug fix that we needed after the release, but apparently the server has changed its configuration to not permit clones of individual commits that are not branches / tags?
@davidchisnall I have addressed all comments, and rebased the PR. This is ready for another round of review :slightly_smiling_face:
It looks as if our BearSSL submodule is pointing at a commit that's after the release branch. I think we did that because there was a bug fix that we needed after the release, but apparently the server has changed its configuration to not permit clones of individual commits that are not branches / tags?
For the record, this was addressed in https://github.com/CHERIoT-Platform/network-stack/pull/28
As discussed with @davidchisnall, this looks fine to merge now. It's working well. It has known limitations, which are documented in the README.md
. I will open an issue to keep track of these limitations, and ensure that we end up addressing them.
The TCP/IP stack currently has a lot of state and cannot recover gracefully in case of failure. This commit addresses this limitation.
We introduce a custom error handler for the TCP/IP stack. At a high-level, this error handler terminates all user threads currently present in the TCP/IP compartment and resets the IP thread to its entry point
ip_thread_entry
. It does so by setting all the locks of the compartment for destruction and notifying all futex waiters. Then it frees all heap allocations and resets globals. After all the state has been cleaned up, it restarts the network stack.At this stage, this is more of a RFC than a proper PR for merge. Some documentation TODOs are still there, which I will process in the coming days.
This PR was tested as following:
Note that this ignores the first ping received by the system as one ping may be sent by the network in the first boot phase. You may thus need to send two pings to trigger the crash depending on your network setup. Feel free to adjust this.
The TCP/IP stack should detect that the crash happened with the lock held, forcefully unlock it, and continue its business.
Unmerged dependencies in the main RTOS tree. We need the following two PRs, which will be merged soon:
Known issue:
The timeout of
malloc
is still not quite right: https://github.com/microsoft/cheriot-rtos/blob/main/sdk/include/stdlib.h#L96The value of 30 is generally enough, however in one instance it was still too low and triggered a hang during reset. We should consider increasing this value at a later point or find a more bullet-proof solution. You can easily detect such hangs with the following patch to FreeRTOS+TCP: