esnet / iperf

iperf3: A TCP, UDP, and SCTP network bandwidth measurement tool
Other
6.77k stars 1.27k forks source link

--bidir option randomly fails with "iperf3: error - unable to connect stream: No such file or directory" on Windows #1314

Open madbrain76 opened 2 years ago

madbrain76 commented 2 years ago

Context

I'm using the --bidir feature of iperf3 to measure perform between multiple LAN devices on a variety of operating systems. When using Windows as a client, the option works erratically, often display the message "iperf3: error - unable to connect stream: No such file or directory", but not every time.

Please note: iperf3 is supported on Linux, FreeBSD, and macOS. Support may be provided on a best-effort basis to other UNIX-like platforms. We cannot provide support for building and/or running iperf3 on Windows, iOS, or Android.

I am aware support can't be provided for Windows, but I still think it's worth filing and tracking the issue here. I will attempt to fix the bug myself, if no one else does.

I used cygwin for the build. It was a while back and I don't recall which compiler was used. I didn't succeed in building debug binaries, which is why there is no pull request attached.

Bug Report

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 --bidir
Connecting to host pi64, port 5201
[  5] local 2601:646:8801:9d00:650f:5df3:62b3:42da port 49943 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
[  7] local 2601:646:8801:9d00:650f:5df3:62b3:42da port 49945 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-1.00   sec   249 MBytes  2.09 Gbits/sec
[  7][RX-C]   0.00-1.00   sec  95.3 MBytes   800 Mbits/sec
[  5][TX-C]   1.00-2.00   sec   231 MBytes  1.93 Gbits/sec
[  7][RX-C]   1.00-2.00   sec   149 MBytes  1.25 Gbits/sec
[  5][TX-C]   2.00-3.00   sec   152 MBytes  1.28 Gbits/sec
[  7][RX-C]   2.00-3.00   sec   168 MBytes  1.41 Gbits/sec
[  5][TX-C]   3.00-4.00   sec   202 MBytes  1.69 Gbits/sec
[  7][RX-C]   3.00-4.00   sec   157 MBytes  1.32 Gbits/sec
[  5][TX-C]   4.00-5.00   sec   227 MBytes  1.90 Gbits/sec
[  7][RX-C]   4.00-5.00   sec   151 MBytes  1.27 Gbits/sec
[  5][TX-C]   5.00-6.00   sec   180 MBytes  1.51 Gbits/sec
[  7][RX-C]   5.00-6.00   sec   162 MBytes  1.36 Gbits/sec
[  5][TX-C]   6.00-7.00   sec   161 MBytes  1.35 Gbits/sec
[  7][RX-C]   6.00-7.00   sec   166 MBytes  1.39 Gbits/sec
[  5][TX-C]   7.00-8.00   sec   219 MBytes  1.84 Gbits/sec
[  7][RX-C]   7.00-8.00   sec   154 MBytes  1.29 Gbits/sec
[  5][TX-C]   8.00-9.00   sec   228 MBytes  1.91 Gbits/sec
[  7][RX-C]   8.00-9.00   sec   150 MBytes  1.26 Gbits/sec
[  5][TX-C]   9.00-10.00  sec   163 MBytes  1.37 Gbits/sec
[  7][RX-C]   9.00-10.00  sec   164 MBytes  1.38 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  1.96 GBytes  1.69 Gbits/sec                  sender
[  5][TX-C]   0.00-10.01  sec  1.96 GBytes  1.68 Gbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec  1.48 GBytes  1.27 Gbits/sec    0             sender
[  7][RX-C]   0.00-10.01  sec  1.48 GBytes  1.27 Gbits/sec                  receiver

iperf Done.

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 --bidir
Connecting to host pi64, port 5201
iperf3: error - unable to connect stream: No such file or directory

C:\Users\Julien Pierre\Desktop\iperf3>

The --bidir option should work reliably without error every time.

The message "iperf3: error - unable to connect stream: No such file or directory" frequently appears when trying to use the --bidir option against a remote iperf3 3.10+ server.

Build iperf3 on Windows. Use the binary in client mode with the --bidir option to hit any remote iperf3 3.10+ server, ie.

iperf3 -c <remotehost> --bidir

Repeat this command several times until the error shows up. It is very frequent, more than half the time.

I have not seen this error when running my iperf3 binary as server on Windows, and hitting it it locally with the --bidir option. In this case, the error never appears.

swlars commented 2 years ago

Thanks for creating this issue. As you already know, we don't officially support Windows and it is difficult for us to test on it. Hopefully another Windows user in the community might be able to shed some light on this.

davidBar-On commented 2 years ago

I am running a server under WSL Linux and a client under Cygwin terminal in the same PC without problems. Therefore it may be that the problem is somehow caused by the network between the different machines.

If you want to evaluate this issue further:

  1. What is the server's version? Is it an old version? (Although I tried with 3.7 server with no problems.)
  2. Is there any error reported on the server side?
  3. Run the client (and preferably also the server) with -V -d options. May give additional information that can help.
  4. Try using UDP (-u)? Do you still have the same problem or it happens only with TCP?
madbrain76 commented 2 years ago

David,

Thanks for your response. More inline.

I am running a server under WSL Linux and a client under Cygwin terminal in the same PC without problems. Therefore it may be that the problem is somehow caused by the network between the different machines.

OK. I'm not using WSL. I'm using various remote physical machines as servers.

Actually, the fact that you have WSL Linux running might actually be the reason why you can't reproduce the issue.

While responding to your message, I had 3 VirtualBox VMs running in the background on my Windows desktop. I was only able to reproduce the problem about once out of 20 tries. And I wasn't even hitting the iperf3 server in those VMs. I was still hitting remote systems.

Once I stopped all the background VMs, the frequency of the problem went back to what it was before, failing about 9 out of 10 times. Having VMs in the background would have slowed down the system.

I think there is likely some sort of race condition in play here. I'm using a pretty beefy system for the Windows side, a Ryzen 5950X which has 16 cores / 32 threads, so that might make a difference in the reproducibility of the problem.

Having the VMs running in the backgrounds might cause some of the CPU cores to be reserved for them, and thus reduce the possibility of iperf3 threads jumping between CPU cores/threads, making the race condition much less likely to show up.

If is a race condition, it might show up under another OS, too, but of course, it might not, also. That is the nature of race conditions.

At the moment, Windows is the only thing I have installed on this machine. I will try boot Ubuntu from a live USB and see if I can reproduce there on the same hardware.

If you want to evaluate this issue further:

1. What is the server's version?  Is it an old version? (Although I tried with 3.7 server with no problems.)

I am using the server versions between 3.10 and 3.11 on Raspberry Pi OS 64-bit on my Raspberry Pi 3B+ and 4B, as well as Ubuntu 20.04 on my Odroid XU4, Ubuntu 22.04 on Odroid N2+, Ubuntu 20.04 on my x64 NAS. I built iperf3 myself on all these.

2. Is there any error reported on the server side?

Nothing obvious that I can see. Here is a session from the server when the problem happened :

root@pi64:~/scripts# iperf3 -s -V -d
iperf 3.10.1
Linux pi64 5.15.40-v8julien+ #2 SMP PREEMPT Tue May 17 19:35:43 PDT 2022 aarch64
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
get_parameters:
{
    "tcp":  true,
    "omit": 0,
    "time": 10,
    "parallel": 1,
    "bidirectional":    true,
    "len":  131072,
    "pacing_timer": 1000,
    "client_version":   "3.11"
}
SNDBUF is 16384, expecting 0
RCVBUF is 131072, expecting 0
Time: Sat, 21 May 2022 02:10:59 GMT
Accepted connection from 2601:646:8801:9d00:650f:5df3:62b3:42da, port 49416
      Cookie: nfdrwgyredfamour7pgijl4szaahokdogroa
      TCP MSS: 0 (default)
Congestion algorithm is cubic
[  5] local 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201 connected to 2601:646:8801:9d00:650f:5df3:62b3:42da port 49417
iperf 3.10.1
Linux pi64 5.15.40-v8julien+ #2 SMP PREEMPT Tue May 17 19:35:43 PDT 2022 aarch64

Here is the corresponding session from the client side :

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 --bidir -V -d
iperf 3.11
CYGWIN_NT-10.0-19044 HIGGS 3.3.3-341.x86_64 2021-12-03 16:35 UTC x86_64
Control connection MSS 1440
send_parameters:
{
        "tcp":  true,
        "omit": 0,
        "time": 10,
        "parallel":     1,
        "bidirectional":        true,
        "len":  131072,
        "pacing_timer": 1000,
        "client_version":       "3.11"
}
Time: Sat, 21 May 2022 02:10:59 GMT
Connecting to host pi64, port 5201
      Cookie: nfdrwgyredfamour7pgijl4szaahokdogroa
      TCP MSS: 1440 (default)
SNDBUF is 65536, expecting 0
RCVBUF is 65536, expecting 0
[  5] local 2601:646:8801:9d00:650f:5df3:62b3:42da port 49417 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream: No such file or directory
3. Run the client (and preferably also the server) with `-V -d` options.  May give additional information that can help.

Yes, see 2) .

4. Try using UDP (`-u`)?  Do you still have the same problem or it happens only with TCP?

I just tried with UDP, and I was able to reproduce it too.

Here is the client output (Windows side) :

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 --bidir -u -V -d
iperf 3.11
CYGWIN_NT-10.0-19044 HIGGS 3.3.3-341.x86_64 2021-12-03 16:35 UTC x86_64
Control connection MSS 1440
Setting UDP block size to 1440
send_parameters:
{
        "udp":  true,
        "omit": 0,
        "time": 10,
        "parallel":     1,
        "bidirectional":        true,
        "len":  1440,
        "bandwidth":    1048576,
        "pacing_timer": 1000,
        "client_version":       "3.11"
}
Time: Sat, 21 May 2022 02:09:24 GMT
Connecting to host pi64, port 5201
      Cookie: cbkouddtfusobwdq4jdedm55rdr66pegctue
      Target Bitrate: 1048576
SNDBUF is 65536, expecting 0
RCVBUF is 65536, expecting 0
Setting application pacing to 131072
[  5] local 2601:646:8801:9d00:650f:5df3:62b3:42da port 63601 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream: No such file or directory

Here is the server output (Pi 4B) :


root@pi64:~/scripts# iperf3 -s -V -d
iperf 3.10.1
Linux pi64 5.15.40-v8julien+ #2 SMP PREEMPT Tue May 17 19:35:43 PDT 2022 aarch64
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
get_parameters:
{
    "udp":  true,
    "omit": 0,
    "time": 10,
    "parallel": 1,
    "bidirectional":    true,
    "len":  1440,
    "bandwidth":    1048576,
    "pacing_timer": 1000,
    "client_version":   "3.11"
}
Time: Sat, 21 May 2022 02:09:24 GMT
Accepted connection from 2601:646:8801:9d00:650f:5df3:62b3:42da, port 49378
      Cookie: cbkouddtfusobwdq4jdedm55rdr66pegctue
      Target Bitrate: 1048576
SNDBUF is 212992, expecting 0
RCVBUF is 212992, expecting 0
Setting application pacing to 131072
[  5] local 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201 connected to 2601:646:8801:9d00:650f:5df3:62b3:42da port 63601
iperf 3.10.1
Linux pi64 5.15.40-v8julien+ #2 SMP PREEMPT Tue May 17 19:35:43 PDT 2022 aarch64
madbrain76 commented 2 years ago

Booted up Ubuntu on the same AMD 5950X hardware. Built iperf3 from master. I ran 1000 iterations of "iperf3 -c pi64 --bidir -t 1", and none ran into any problem. It would seem that this race condition is specific to Windows.

davidBar-On commented 2 years ago

@madbrain76, thanks for the detailed info.

I see in the iperf3 code that when the client is connecting to a socket, errno EINPROGRESS is regarded as a successful connection, and that by default the client does not wait for the connection to complete. It may be that the connection from Windows takes more time (or that the Cygwin implementation of connect() is different.

If this is the problem, then for UDP it can be mitigated by setting the --connect-timeout option. Can you try running the UDP test with setting this option? I am not sure what is a reasonable waiting time, but probably one second (--connect-timeout 1000) or few seconds should be enough.

If that does not solve the problem, can you try either or both TCP and UDP tests with several streams, e.g -P3, instead of --bidir? That may indicate if the problem is related to setting multiple sockets.

One more thing that may help the evaluation is adding the -J option to the command line as it may give more information.

madbrain76 commented 2 years ago

David,

I just tried --connect-timeout in conjunction with UDP (-u). I was still able to reproduce the problem. I tried as high as --connect-timeout 50000 . The program returned the failure within just a couple of seconds, though. So, I don't think the option is being honored. I was able to reproduce the problem with multiple TCP streams, also. The problem seems prevalent whenever there is more than one socket indeed, not just with --bidir .

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 -P 3 -t 1
Connecting to host pi64, port 5201
[  5] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 59123 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream: No such file or directory

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 -P 3 -t 1 -u
Connecting to host pi64, port 5201
[  5] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 53265 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream: No such file or directory

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 -P 3 -t 1 -u --connect-timeout 1000
Connecting to host pi64, port 5201
[  5] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 56263 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream: No such file or directory

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 -P 3 -t 1 -u --connect-timeout 5000
Connecting to host pi64, port 5201
[  5] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 51429 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream: No such file or directory

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 -P 3 -t 1 -u --connect-timeout 50000
Connecting to host pi64, port 5201
[  5] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 61765 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream: No such file or directory

Also tested with the public binaries of iperf3 version 3.1.3, which don't support --bidir, and was able to reproduce the problem with them as well. I used those very same binaries in the past with no issues even with multiple sockets back when I had a slower machine with fewer cores - an Intel i7-5820k. I upgraded to the AMD 5950X last November. I am still using the exact same Aquantia physical NIC, too. The Windows 10 OS has gone through updates, though, as has the NIC driver.

D:\Downloads\iperf-3.1.3-win64>iperf3 -c pi64 -P 3 -t 1
Connecting to host pi64, port 5201
[  4] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 59325 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream:

D:\Downloads\iperf-3.1.3-win64>iperf3 -c pi64 -P 3 -t 1 -u
Connecting to host pi64, port 5201
[  4] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 52801 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
[  6] local 2601:646:8801:9d00:382b:14a2:5a87:3580 port 56086 connected to 2601:646:8801:9d00:53c0:7267:b327:e9c port 5201
iperf3: error - unable to connect stream:

Here is one attempt with my 3.11 binaries, and -J :

C:\Users\Julien Pierre\Desktop\iperf3>iperf3 -c pi64 -P 3 -4  -J
{
        "start":        {
                "connected":    [{
                                "socket":       5,
                                "local_host":   "192.168.1.3",
                                "local_port":   59526,
                                "remote_host":  "192.168.1.129",
                                "remote_port":  5201
                        }],
                "version":      "iperf 3.11",
                "system_info":  "CYGWIN_NT-10.0-19044 HIGGS 3.3.3-341.x86_64 2021-12-03 16:35 UTC x86_64",
                "timestamp":    {
                        "time": "Sat, 21 May 2022 12:38:39 GMT",
                        "timesecs":     1653136719
                },
                "connecting_to":        {
                        "host": "pi64",
                        "port": 5201
                },
                "cookie":       "yzzdplzdflxztgzhyjrwx6jgehue2zf3edgw",
                "tcp_mss_default":      1460,
                "target_bitrate":       0,
                "sock_bufsize": 0,
                "sndbuf_actual":        65536,
                "rcvbuf_actual":        65536
        },
        "intervals":    [],
        "end":  {
        },
        "error":        "unable to connect stream: No such file or directory"
}
davidBar-On commented 2 years ago

Julien,

I used those very same binaries in the past with no issues even with multiple sockets back ...

It may be that the issue is related to changes in Windows and/or Cygwin. Later this week I will try to build a Windows version with additional debug messages. Hopefully that will at least allow identifying the exact place where the problem happens.

madbrain76 commented 2 years ago

The old 3.1.3 binaries I used come with the old cygwin DLL, which rules out a cygwin regression as the root cause.

It's likely something to do with Windows changes Or possibly a race that just didn't show up with 6 cores / 12 threads, but does with 16 cores / 32 threads.

If I could build an executable with debug information, I would step through the code and find the places to add the debug messages myself. Of course, it's very unlikely the race would be reproducible within the debugger itself.

I just didn't manage to get autoconf to output the right debug Makefile.

Debug messages can also cause synchronization issues with stdout/stderr, and hide race conditions.

davidBar-On commented 2 years ago

If I could build an executable with debug information, I would step through the code and find the places to add the debug messages myself. Of course, it's very unlikely the race would be reproducible within the debugger itself

If you can build yourself, here are the places I plan to add initial debug messages (may be just using printf()) tp identify the function where the problem happens:

  1. iperf_udp_connect(): before and after the call to netdial().
  2. iperf_tcp_connect(): before and after the calls to create_socket(), connect().
  3. create_socket(): before and after the calls to socket(), bind().
  4. netdial(): before and after calls to create_socket(), timeout_connect().
  5. timeout_connect(): before and after the calls to fcntl() (3 times), connect(), poll(), getsockopt().

Debug messages can also cause synchronization issues with stdout/stderr, and hide race conditions.

If that will be the case, then it may be possible to detect the exact place where adding some delay can help, at least as a workaround.

davidBar-On commented 2 years ago

Julien, attached is a Windows iperf3 debug version with the debug messages (always displayed - no need for id). If you will use this version it may be possible to identify where the problem occurs. This will allow more focused evaluation of the problem. iperf3_test_bidir_issue_1314.zip

madbrain76 commented 2 years ago

David, Thank you for these binaries. Here is the output for a failure case with the -J option. The debug messages didn't prevent the problem from being reproduced.

D:\Downloads\iperf3_test_bidir_issue_1314>iperf3 -c pi64.local --bidir -J -d
warning: Debug output (-d) may interfere with JSON output (-J)
[BIDIR] netdial: calling create_socket
[BIDIR] create_socket: calling socket
[BIDIR] create_socket: after calling socket - s=4
[BIDIR] netdial: after calling create_socket - s=4, errno=0
[BIDIR] netdial: calling timeout_connect - s=4
[BIDIR] timeout_connect: calling connect - s=4
[BIDIR] timeout_connect: after connect - s=4, errno=0
[BIDIR] timeout_connect: calling fcntl2 - s=4
[BIDIR] timeout_connect: after fcntl2 - s=4
[BIDIR] netdial: after calling timeout_connect - s=4, errno=0
send_parameters:
{
        "tcp":  true,
        "omit": 0,
        "time": 10,
        "parallel":     1,
        "bidirectional":        true,
        "len":  131072,
        "pacing_timer": 1000,
        "client_version":       "3.11"
}
[BIDIR] iperf_tcp_connect: calling create_socket
[BIDIR] iperf_tcp_connect: after create_socket - s=-1
{
        "start":        {
                "connected":    [],
                "version":      "iperf 3.11",
                "system_info":  "CYGWIN_NT-10.0-19044 HIGGS 3.3.5-341.x86_64 2022-05-13 12:27 UTC x86_64",
                "timestamp":    {
                        "time": "Sat, 28 May 2022 09:56:08 GMT",
                        "timesecs":     1653731768
                },
                "connecting_to":        {
                        "host": "pi64.local",
                        "port": 5201
                },
                "cookie":       "pzanrdkp7dj6p7pk7tsdte3p4pb3ihipdq5o",
                "tcp_mss_default":      1440,
                "target_bitrate":       0
        },
        "intervals":    [],
        "end":  {
        },
        "error":        "unable to connect stream: No such file or directory"
}

I actually managed to build the debug binaries since your last post. I tried to trace the thread ID, and noticed that calls were executed in a single thread, even when multiple sockets were in use with --bidir . This makes it even more curious.

Would you mind providing a pointer to the source/patch you used that contains all your debug messages ? That will be helpful.

Another test I performed was to try running the client against a local iperf3 server on the Windows machine. I cannot reproduce the problem in this case.

madbrain76 commented 2 years ago

The problem appears to be in the following block of code in create_socket, which includes logging I added :

    memset(&hints, 0, sizeof(hints));
    hints.ai_family = domain;
    hints.ai_socktype = proto;
    snprintf(portstr, sizeof(portstr), "%d", port);
    if ((gerror = getaddrinfo(server, portstr, &hints, &server_res)) != 0) {
        fprintf(stdout, "create_socket : getaddrinfo failure case #2. gerror = %d, server = %s, portstr = %s, hints.ai_family = %d, hints.ai_socktype = %d.\n", gerror, server, portstr, hints.ai_family, hints.ai_socktype);
    if (local)
        freeaddrinfo(local_res);
        return -1;
    }
    else {
    fprintf(stdout, "create_socket : getaddrinfo success . server = %s, portstr = %s, hints.ai_family = %d, hints.ai_socktype = %d.\n", server, portstr, hints.ai_family, hints.ai_socktype);
    }

The relevant output for the failure case is :

create_socket : getaddrinfo failure case #2. gerror = 8, server = pi64.local, portstr = 5201, hints.ai_family = 0, hints.ai_socktype = 1.

Unfortunately, I also see :

create_socket : getaddrinfo success . server = pi64.local, portstr = 5201, hints.ai_family = 0, hints.ai_socktype = 1.

ie. the input arguments are identical for the failure or success cases of this API call.

I see gerror is declared as extern int, ie. it's a global across source files, which I would say is not best practice, but if there is no multi-threading, it might work OK. I tried to use a local for the return value of getaddrinfo in this one instance, and that didn't change the behavior.

A return code for getaddrinfo of 8 is listed in netdb.h as :

#define EAI_NONAME 8 /* Name or service not known */

Either the implementation of getaddrinfo in cygwin is broken, or there is something else underneath (OS) that is broken here.

davidBar-On commented 2 years ago

(UPDATE - added 3rd suggestion) This is interesting! I was going to send you a new iperf3 version with additional debug messages for the getaddrinfo, as I didn't expect initially that the error may be there.

To further evaluate the issue I suggest adding the following (I can do it my self, but it seems it is faster that you will make the changes and may enhance them per the results you get):

  1. Use gai_strerror() to get the text of error 8. Just to make sure that it is the same in Cygwin as in standard Linux.
  2. The getaddrinfo man page say that it "combines the functionality provided by the gethostbyname(3) and getservbyname(3) functions". Suggest to add calls to gethostbyname and getservbyname before the call to getaddrinfo, just to see if and how each of them fails. That may give better info about the problem.
  3. To see if the problem is temporary or maybe a Cygwin/Windows problem put getaddrinfo in a loop while it fails (e.g. up 10 times) with a sleep of one second between the calls.

(Regarding not using threads: this is one of the main design decisions iperf3 developers took - is in the FAQ.)

madbrain76 commented 2 years ago

David,

  1. gai_strerror() returned "Name or service not known" for error 8 . I had looked up that error in the cygwin netdb.h previously.
  2. I checked just by calling gethostbyname(server) prior to getaddrinfo() . It also fails . h_errno has a value of 1, which is HOST_NOT_FOUND .
  3. I tried something a little different. I ran gethostbyname() up to 10 iterations, until it succeeds, and without any delay.
    int lerror = 0;
    int i = 1;
    do {
        if (NULL == gethostbyname(server)) {
            fprintf(stdout, "create_socket : gethostbyname failure case #8. errno = %d , server = %s, iteration = %d .\n", h_errno, server, i);
        } else {
            break;
        }
    } while (i++ < 10);
    if ((lerror = getaddrinfo(server, portstr, &hints, &server_res)) != 0) {
        fprintf(stdout, "create_socket : getaddrinfo failure case #2. lerror = %d, error string = %s, server = %s, portstr = %s, hints.ai_family = %d, hints.ai_socktype = %d.\n", lerror, gai_strerror(lerror), server, portstr, hints.ai_family, hints.ai_socktype);

I only saw it fail on the first iteration, and never subsequent ones. I let the code proceed to the call to getaddrinfo() regardless, and that call never failed.

This is all quite strange. I built the cygwin dll from source, and tried to step through, but it turned out -O2 is applied and some arguments on the stack were still optimized out. I will rebuild with -O0 and investigate further tomorrow. Is there a proper way to do this with autoconf ? I replaced all instances of -O2 with -O0 in the generated Makefile, but it seems like there should be a better way.

madbrain76 commented 2 years ago

I ruled out cygwin as the root cause of the problem. I wrote a program that performs multiple iterations of gethostbyname as follows :

#ifdef _MSC_VER
#include <winsock.h>
#else
#include <netdb.h>
#endif
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    if (argc != 3) {
        printf("Syntax : host <hostname> <iterations>.\n");
        return -2;
    }
    int iterations = atoi(argv[2]);
    if (iterations < 1) {
        printf("Must perform at least 1 iteration.\n.");
        return -3;
    }
#ifdef _MSC_VER
    WSADATA ret_data;
    if (0 != WSAStartup(MAKEWORD(2, 2), &ret_data)) {
        printf("WSAStartup failed.\n");
        return -1;
    }
#endif
    for (int i=1; i<1+iterations ; i++) {
        if (NULL == gethostbyname(argv[1])) {
            printf("gethostbyname failure. errno = %d , iteration = %d .\n", h_errno, i);
        } else {
            printf("Iteration %d OK.\n", i);
        }
    }
}

This fails randomly, whether compiled with cygwin/gcc or MSVC :

cygwin:

$ ./a.exe  pi64.local 10
Iteration 1 OK.
Iteration 2 OK.
gethostbyname failure. errno = 1 , iteration = 3 .
Iteration 4 OK.
gethostbyname failure. errno = 1 , iteration = 5 .
Iteration 6 OK.
Iteration 7 OK.
Iteration 8 OK.
gethostbyname failure. errno = 1 , iteration = 9 .
Iteration 10 OK.

MSVC :

D:\Dev\host>host.exe pi64.local 10
Iteration 1 OK.
Iteration 2 OK.
Iteration 3 OK.
Iteration 4 OK.
gethostbyname failure. errno = 11001 , iteration = 5 .
Iteration 6 OK.
Iteration 7 OK.
Iteration 8 OK.
gethostbyname failure. errno = 11001 , iteration = 9 .
Iteration 10 OK.

Unfortunately, I still have no clue as to the root cause, but it's fair to say it's not an iperf3 or a cygwin bug. I think it is an OS bug. I tried with multiple NICs, and multiple switches, and could reproduce the problem with all of them. I even tried a fresh install of the OS (Win10 21H2) and reproduced the problem just the same. I just restored my OS from backup, since no 3rd party software seems to be causing the problem.

If I use localhost as the hostname, I never see the problem, and the command returns almost instantly If I use Internet DNS hostnames, I never see the problem. If I use hostnames for other Windows hosts on my LAN, I never see the problem.

Julien Pierre@HIGGS /cygdrive/d/Dev/host
$ time ./a.exe  pi64.local 10
Iteration 1 OK.
Iteration 2 OK.
gethostbyname failure. errno = 1 , iteration = 3 .
Iteration 4 OK.
gethostbyname failure. errno = 1 , iteration = 5 .
Iteration 6 OK.
gethostbyname failure. errno = 1 , iteration = 7 .
Iteration 8 OK.
Iteration 9 OK.
gethostbyname failure. errno = 1 , iteration = 10 .

real    0m14.777s
user    0m0.000s
sys     0m0.000s

Julien Pierre@HIGGS /cygdrive/d/Dev/host
$ time ./a.exe  localhost 10
Iteration 1 OK.
Iteration 2 OK.
Iteration 3 OK.
Iteration 4 OK.
Iteration 5 OK.
Iteration 6 OK.
Iteration 7 OK.
Iteration 8 OK.
Iteration 9 OK.
Iteration 10 OK.

real    0m0.019s
user    0m0.000s
sys     0m0.015s

Julien Pierre@HIGGS /cygdrive/d/Dev/host
$ time ./a.exe  www.cygwin.com 10
Iteration 1 OK.
Iteration 2 OK.
Iteration 3 OK.
Iteration 4 OK.
Iteration 5 OK.
Iteration 6 OK.
Iteration 7 OK.
Iteration 8 OK.
Iteration 9 OK.
Iteration 10 OK.

real    0m0.108s
user    0m0.000s
sys     0m0.015s

Julien Pierre@HIGGS /cygdrive/d/Dev/host
$ time ./a.exe  htpc-ryzen.local 10
Iteration 1 OK.
Iteration 2 OK.
Iteration 3 OK.
Iteration 4 OK.
Iteration 5 OK.
Iteration 6 OK.
Iteration 7 OK.
Iteration 8 OK.
Iteration 9 OK.
Iteration 10 OK.

real    0m0.040s
user    0m0.000s
sys     0m0.000s

Julien Pierre@HIGGS /cygdrive/d/Dev/host
$

I noticed that each iteration takes about 1.5 second, whether successful or failed.

Even though I have not seen iperf3 fail in single socket mode again, I am also able to show the problem even on the first iteration when running the test program multiple times from a batch file.

D:\Dev\host>type loop.bat
@echo off
FOR /L %%A IN (1,1,10) DO host.exe pi64.local 1
D:\Dev\host>loop
Iteration 1 OK.
Iteration 1 OK.
gethostbyname failure. errno = 11001 , iteration = 1 .
Iteration 1 OK.
Iteration 1 OK.
gethostbyname failure. errno = 11001 , iteration = 1 .
Iteration 1 OK.
gethostbyname failure. errno = 11001 , iteration = 1 .
Iteration 1 OK.
Iteration 1 OK.

I have also tried running the test program on several other Windows hosts, and it never fails on any of them, although it is similarly very slow when trying to resolve those non-Windows LAN hostnames.

So, reproducing the issue requires : 1) running on my system with the AMD 5950X CPU machine, which is my primary desktop 2) running Windows 10 21H2 3) trying to resolve one of my Unix LAN hosts

Not sure how many others on the planet are affect by this, but this is a very annoying bug.

If you have any idea where I should report this to MS, please make suggestions, before I close this issue against iperf3 here.

davidBar-On commented 2 years ago

This is indeed strange, but hopefully your evaluation will help solvin the problem, or at least finding a workaround for. There are several issues about this problem in Windows.

Is there a proper way to do this with autoconf ? I replaced all instances of -O2 with -O0 in the generated Makefile, but it seems like there should be a better way.

I am currently using the pre-built version of the dll. However, it seems that one the following can be done:

  1. Using CFLAGS environment variable: A. sentenv CFLAGS "-g" B. run autoconf (or autoreconf) followed by ./configure to regenerate the Makefile C. If B doesn't work, just run ./configure
  2. If using CFLAGS doesn't work, either: A. Edit autoconf.ac and remove the "o2" from the settings of "CFLAGS_FOR_TARGET" and "CXXFLAGS_FOR_TARGET". Then run ./configure to regenerate the Makefile. B. If A doesn't work, remove "O2" from configure before executing it.
davidBar-On commented 2 years ago

Julien, Great evaluation!

Not sure how many others on the planet are affect by this, but this is a very annoying bug.

There were other reports about similar issue on Windows, so it seems your configuration is not unique.

If you have any idea where I should report this to MS, please make suggestions, before I close this issue against iperf3 here.

I will try to help finding how the issue can be reported to MS. However, please don't close the issue. As it seems from the results you got that if the first call fails the next call is always successful, I think a PR should be submitted to iperf3 where a failed getaddrinfo is called again (maybe loop of 3 calls?). Do you want to submit such PR? If not, I will submit it (basically adding iperf3_getaddrinfo() function in net.c that does the calls loop, and changing all the calls to getaddrinfo to use this function.)

madbrain76 commented 2 years ago

David,

Thanks for the help about setting the optimization level. I'll give it a shot.

re: this issue, do you have any references to those other reports ? I'd love to see what they have in common, in particular the hardware configuration.

As far as adding a loop to work around this problem, do you really think this is the way to go here ?

In my experience, putting workarounds for bugs that exist in other layers/products may disincentive third parties from fixing their own bugs. I could write pull request, of course. But I would probably want to output something about this problem being detected, to see if it can help get it fixed correctly in the right layer.

I modified the test case to use getaddrinfo, since gethostbyname is deprecated. I posted about the issue in :

https://answers.microsoft.com/en-us/windows/forum/all/getaddrinfo-api-call-randomly-fails-for-unix-lan/26cff155-6649-4c80-8559-c8d798df2bd4

That is probably not the right forum, though.

davidBar-On commented 2 years ago

re: this issue, do you have any references to those other reports ? I'd love to see what they have in common, in particular the hardware configuration.

Actually I was able to find only one: #1201. Not sure if I thought there are several such issues because I saw both that and your issue or there are others that I didn't find ...

As far as adding a loop to work around this problem, do you really think this is the way to go here ?

The main reasons I think this is the way to go is both because it may take MS a long time to fix the issue (if at all) and because iperf3 is not officially maintained for Windows. Therefore, it may not make any difference in MS priorities if an iperf3 workaround is implemented. As many people use the Windows version, it is desired that it will be as reliable as possible, and it is quite rare that Windows users that find issues are willing (or have the capabilities) to evaluate the issues as you did (see for example the other issue I referenced).

I modified the test case to use getaddrinfo, since gethostbyname is deprecated. I posted about the issue in : ... That is probably not the right forum, though.

While trying to find someone that knows the right forum, I got a recommendation to check the related traffic between the Window and the Linux machines, using Wireshark (or tcpdump). Is this something you can do? If yes, probably logs should be collected on both machines.

madbrain76 commented 2 years ago

Thanks for that link. Not much info in there except the CPU and Windows 10 build number.

You convinced me that this there should be a workaround in iperf3. I see that there are a lot of calls to getaddrinfo, as well as corresponding calls to freeaddrinfo . This may be a long pull request if I have to patch every single place. The logic is quite messy. I know Dijkstra considered harmful, but this one case where it really would be much more appropriate, rather than to duplicate the freeaddrinfo call for every failure case.

Also, since it takes over a second to create the socket in this case, I think the proper fix should cache the response. When using multiple sockets, for example -P 64, even if all the calls succeed, it will take about a minute and a half before the test actually starts. So, I think the right fix is a wrapper for getaddrinfo that would do both retries and caching. Each can be done in separate pull requests.

I thought of taking network dumps as well. From what I understand, this is expected to be mDNS traffic.

madbrain76 commented 2 years ago

I created a pull request at https://github.com/esnet/iperf/pull/1346 . I don't see a way to request a review. It's been a while since I last used github to submit code.

davidBar-On commented 2 years ago

I think the proper fix should cache the response

I also thought about caching the responses. However, when this may be a problem at least for a server as the network/addresses may change. I found an old issue about that for gethostbyname().

I created a pull request at https://github.com/esnet/iperf/pull/1346 . I don't see a way to request a review. It's been a while since I last used github to submit code.

This is great! I hope it will be merged into the mainline. As far as I am aware there is no request for review, or at least it is not used. The iperf3 maintainers go over the submitted PRs.

I thought of taking network dumps as well. From what I understand, this is expected to be mDNS traffic.

Are you going to do that? I believe it may really help. Even if it will not help solving the core problem, it may at least help to understand the source of problem.

madbrain76 commented 2 years ago

David,

Can you please point to what the issue was with gethostbyname() ?

Hostnames can indeed have their address change over time. Some DNS servers have round-robin schemes, or others. However, is this really something that one would expect to be relying on in the context of a single iperf run with multiple sockets ? For example, with the --bidir option, resolving the name twice to two different IPs could conceivably end up initiating connections to two different servers. I can't imagine this is what one ever actually want to happen. Normally, the OS will cache the result of the DNS lookup, and this situation will just not occur. I don't know why Windows is not caching the result of the lookup in this case. It's probably related to the reason why the resolution intermittently fails.

Re: the PRs, I'll just wait for the review, then.

Re: the network dumps, yes, I will take some. But I'll have to filter things out, as there is going to be a lot of other traffic. It may take me some time.

davidBar-On commented 2 years ago

Julien, regarding the issue was with gethostbyname(), I am not able to find it again. I will retry later. I did find this issue that may be somewhat related. This issue is also not directly related by may be interesting.

madbrain76 commented 2 years ago

Thanks. This is interesting. I see the issue relates to IPv6. I have seen the problem with getaddrinfo even when using the -4 switch in iperf3, though. So, it may not be related.

One interesting thing I see is that I don't see the problem when trying to resolve "homeassistant" which is the hostname of my HAOS installation on a Raspberry Pi 3B+. And the hostname only resolves to an IPv4 address, for some reason.

I just tried disabling IPv6 both on my Windows system and on pi64.local. I still see the problem with the intermittent getaddrinfo failure, though. So, I don't know if IPv6 is part of the problem or not.

davidBar-On commented 2 years ago

Thanks for the info.

I did find the discussion I was looking for. It is not about gethostname() as I thought, but is related to Unix sockets and Cygwin, although I am not sure it is related. The mails thread starts with this message Unix Domain Socket Limitation?](https://sourceware.org/pipermail/cygwin/2020-November/246869.html) on a Cygwin mailing list. Note that the thread spans several months and the "Next message" is applicable only for the specific month. To find the continuation of the thread you should choose the next month from the monthly mails lists.

madbrain76 commented 2 years ago

David, the issue appears to be about connect() indeed, not gethostbyname(). I don't believe this should affect the decision of whether or not to cache the output of a gethostbyname() / getaddrinfo().