colesbury / nogil

Multithreaded Python without the GIL
Other
2.91k stars 107 forks source link

Different behaviour to CPython with socket code #107

Closed lesteve closed 1 year ago

lesteve commented 1 year ago

Maybe the following code is a bit edge-casy and is not supposed to work. The behaviour is different from CPython so I thought I would report it. I noticed this while trying to run the joblib tests with nogil, see https://github.com/joblib/joblib/pull/1387 for more details.

# test.py
import socket
import sys

def test():
    port = int(sys.argv[1])
    address_and_port = ("localhost", port)
    listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    listener.bind(address_and_port)
    listener.listen(1)

    client = socket.create_connection(address_and_port)
    server, client_addr = listener.accept()

if __name__ == "__main__":
    test()

Running the script the first time works fine:

python test.py 12345

The second time I get a OSError: [Errno 98] Address already in use:

python test.py 12345

Output:

Traceback (most recent call last):
  File "/tmp/test.py", line 17, in <module>
    test()
  File "/tmp/test.py", line 9, in test
    listener.bind(address_and_port)
OSError: [Errno 98] Address already in use

A possible work-around is to explicitly close the client e.g. like this:

with socket.create_connection(address_and_port) as client:
    server, client_addr = listener.accept()

Not a networking expert, but here is the netstat output after running the script the first time:

❯ sudo netstat -tanl | grep 12345
tcp        0      0 127.0.0.1:12345         127.0.0.1:47622         TIME_WAIT 
colesbury commented 1 year ago

Hi @lesteve, thanks for the bug report. This looks like it's due to the order of destruction of local variables: upstream CPython clears the variables in order they were created (listener, client, server), while "nogil" Python clears them in reverse order (server, client, listener.) The order of destruction affects which ends closes the socket first, which affects which socket ends up in the TIME_WAIT state. If it's the socket with a randomly assigned port (the "accepted" socket), it's not an issue. If it's port 12345, you get an address already in use error when re-running the example.

The difference in behavior here was not intentional, but at this point I'm hesitant to change it and risk introducing other bugs. I don't expect this behavioral change to be part of PEP 703.

ctismer commented 1 year ago

That might be the similar PySide bug

colesbury commented 1 year ago

@ctismer They are similar in that they both have to do with order of destruction, but they involve separate code paths.

lesteve commented 1 year ago

Thanks a lot @colesbury for your answer, our code was a bit edge-casy. The behaviour change is slightly surprising but the work-around is easy so I am fine closing this issue.