esnet / iperf

iperf3: A TCP, UDP, and SCTP network bandwidth measurement tool
Other
6.73k stars 1.27k forks source link

The iperf3 server hangs while printing the report #1735

Open RizziMau opened 1 month ago

RizziMau commented 1 month ago

Context

I am using iperf3 in a product that tests the UDP data throughput of mobile networks. Sometimes the iperf3 server hangs while printing the report, and all subsequent tests fail with the message iperf3: error - the server is busy running a test. Try again later.

The only solution I have found is to kill and restart the iperf3 server.

Bug Report

The issue is not systematic but occurs after several hours. When the iperf3 server hangs:

The only solution I have found is to kill and restart the iperf3 server.

The iperf3 server is started with the following command: iperf3 --server --interval 0 -p 5202 -1 Note that the "-1" option causes iperf3 to exit after one transfer, but a daemon restarts it after 2 seconds.

The iperf3 client command (on the Android phone) is: iperf3 --forceflush -c x.x.x.x -V -p 5202 -u -t 15 -i 5 -fK -4 -b 5000000 -l 1200 -P 4 -O 0 (x.x.x.x is the IP address of the server)

I've attached the logs of iperf3 server executed with the --debug=3 option:

iperf3server_debug3_ok.log reports the correct behaviour: at timestamp 06:54:11 iperf3 prints the report and exits, according to the -1 option iperf3server_debug3_ok.log

iperf3server_debug3_blocked.log reports the wrong behaviour: at timestamp 06:42:15 iperf3 prints the report but it does not exit at timestamp 06:52:23 iperf3 has been killed and it traces "iperf3: interrupt - the server has terminated" iperf3server_debug3_blocked.log

davidBar-On commented 1 month ago

@RizziMau, it may be that the problem is when the server is waiting for the "Done" from the client. However, since the client already terminated, the control socket should not be available already between the processes and the server should have failed / timeout. Therefore, it is not clear why the server did not end.

In any case, few questions:

  1. Just to make sure. The failed log is reverse test (-R) and the success log non-reverse. Is this because that all the reverse tests have this error?
  2. Again, to make sure, is the client version also 3.16?
  3. Can you recreate a failed log (using --debug=3), but now try running the client few time when the server is stuck (before terminating it). When the client display the the server is busy running a test message, the server should display successfully sent ACCESS_DENIED to an unsolicited connection request during active test? Does the server display thiese messages? (Running the client few times is to make sure that some of the server messages will be displayed, as they are not flushed.
  4. When the test is running there should be 5 server threads - the main thread and one for each of the 4 streams. When the server got stuck:
    • Are all the 5 threads still running?
    • How much CPU each of them consumes?
  5. Are you able to build iperf3 executable for the server (Linux)? I am asking in case it will be useful to create special versions that may help the analysis (usually with more debug messages).
davidBar-On commented 1 month ago

@RizziMau, I believe I understand what is the problem ("DONE" state sent by the client is not received by the server and the server is waiting for it forever). I have a proposed fix for the problem, using the --rcv-timeout value as the timeout for waiting. Before I submit a PR for the fix, it would be very helpful if you can test at least the server side of it, to confirm that indeed this is a fix for the problem.

Can build and run iperf3 for the server (at least) from branch "issue-1735-timeout-select-when-not-in-running-state" in "https://github.com/davidBar-On/iperf.git" (git clone https://github.com/davidBar-On/iperf.git -b issue-1735-timeout-select-when-not-in-running-state)?

RizziMau commented 1 month ago

@davidBar-On, I'm testing your iperf3 version, I will keep you update.

Just a remark: for this issue it's better to use the --rcv-timeout parameter, or it's better to add a new parameter, e.g. --done-timeout or --close-timeout?

Iperf3 has different timeouts for different aspects:

I think it would be preferable to use a specific timeout parameter for the closure of the control connection.

davidBar-On commented 1 month ago

@RizziMau, thanks for testing the change. It would help validating it in general (and of course finding if it solves your issue).

Regarding the use of --rcv-timeout. The iperf3 teams does not like to add new options. Therefore, for having a better chance that the change will be merged into mainline, I try to reuse existing options when possible.

Currently the --rcv-timeout value is used only when test data is sent (TEST_RUNNING state). Usually, its value is probably related to the length of "network stuck" periods (except for very low test bandwidth, e.g. sending a packet once in every 10 seconds). Therefore, I believe that this value can also be used for the timeout of receiving control messages when test data is not sent.

Note that regardless of the option name, the change implements timeout for most of the state-change control messages. I believe it solves a general issue in iperf3 that server/client are getting stuck if a control message is not received (your issue is one example).