genodelabs / genode

Genode OS Framework
https://genode.org/
Other
1.03k stars 249 forks source link

lxip: fix remaining failing tests since version upgrade #5165

Closed ssumpf closed 1 month ago

ssumpf commented 1 month ago
ssumpf commented 1 month ago
trimpim commented 1 month ago

@ssumpf One regression we encountered with the new lxip is as follows:

We have a management agent that communicates with the Azure IoT cloud stack. One of the test we perform is an echo test where the cloud sends a message and expects it to be replied by the client. If the Message is larger than ~64KiB, the new stack fails to send the reply. Receiving the message looks OK, but the message isn't sent and the cloud stack tries to send it over and over again after a timeout.

I wasn't yet able to create a simple regression test for this. Debugging this in the complex scenario is tedious because the whole system produces a lot of network traffic and all communication with the cloud needs to be encrypted.

With vfs_lwip and the old vfs_lxip the test succeeds.

cnuke commented 1 month ago

@nfeske please merge c1c5d1b as it adresses the build error on arm_v8a.

EDIT: and should address the runtime issues as well.

nfeske commented 1 month ago

Thanks a lot!

trimpim commented 1 month ago

@ssumpf I'm still analyzing my problem.

I was able to track down the sending path of the affected program down to a call to SSL_write().

I'm currently analyzing, what the azure library does, as in wireshark it looks like the connection is closed by the client.

ssumpf commented 1 month ago

@ssumpf One regression we encountered with the new lxip is as follows:

We have a management agent that communicates with the Azure IoT cloud stack. One of the test we perform is an echo test where the cloud sends a message and expects it to be replied by the client. If the Message is larger than ~64KiB, the new stack fails to send the reply. Receiving the message looks OK, but the message isn't sent and the cloud stack tries to send it over and over again after a timeout.

@trimpim: Please revert 840da5d in case you have it applied, since it breaks large packet transmissions in general. I am working on a fix. Otherwise, 64KiB is the maximum TCP segment size. So it might be an issue with segmentation. As a shot into the dark, could you comment out the following line: https://github.com/genodelabs/genode/blob/36a52c6886906614e5d3d37f8a84cb66940abbfc/repos/dde_linux/src/lib/lxip/lx_socket.c#L397 and see if this changes things? Simply ignore all the socket interface call blocked warnings.

trimpim commented 1 month ago

@ssumpf late yesterday I had arrived at the conclusion, that the error is EAGAIN. The problem in debugging this issue was, that as son as I added debug output in the write path of the VFS plugin or in any of the involved libraries, the problem would "self heal". The connection would be interrupted, but after a few retries would be re established again. I did work around this by creating some small helpers that added the log messages to a list and printed them in locations that weren't involved in the write path.

Reverting 840da5d alone doesn't solve the problem. If I comment out the line you suggested, the test succeeds.

ssumpf commented 1 month ago

@trimpim: Thanks, that is what I thought. Could you re-apply 840da5d and comment in the if statement marked TODO and restore the MSG_DONTWAIT flag? Does it work in this case? Thanks.

trimpim commented 1 month ago

@ssumpf with MSG_DONTWAIT enabled and the throw Would_block(); enabled, the test also succeeds without any connection interruptions.

Thanks for your efforts.

ssumpf commented 1 month ago

@trimpim:

The problem in debugging this issue was, that as son as I added debug output in the write path of the VFS plugin or in any of the involved libraries, the problem would "self heal".

We call them Heisenbugs :sunglasses:

chelmuth commented 1 month ago

All available commits entered master. Could we close this issue?

trimpim commented 1 month ago

@chelmuth from my point of view yes.

ssumpf commented 1 month ago

@chelmuth: Since netperf issue turned out to be related to timing, here we go.