Watch seems to leak connections

nelljerram commented 6 years ago

Run a local etcd server on port 2379, e.g. with docker run --net=host quay.io/coreos/etcd.

Create an Etcd3Client and make an initial connection to the server:

(py27) neil@smaug:~/calico/networking-calico$ python
Python 2.7.14+ (default, Mar 13 2018, 15:23:44) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from etcd3gw.client import Etcd3Client
>>> c = Etcd3Client()
>>> c.status()
{u'raftTerm': u'2', u'header': {u'raft_term': u'2', u'revision': u'1', u'cluster_id': u'14841639068965178418', u'member_id': u'10276657743932975437'}, u'version': u'3.2.9', u'raftIndex': u'4', u'dbSize': u'24576', u'leader': u'10276657743932975437'}

Find out the PID of that python shell, and then in another window monitor its etcd connections with

watch 'lsof -p 6770 | grep 2379'

You'll initially see that there is 1 connection:

python  6770 neil    3u  IPv4 837856      0t0      TCP localhost:40306->localhost:2379 (ESTABLISHED)

Start and then cancel a watch:

>>> stream, cancel = c.watch_prefix("/")
>>> cancel()

The watch still shows just 1 connection.

However, now, for each further iteration of

>>> stream, cancel = c.watch_prefix("/")
>>> cancel()

we see 1 more ESTABLISHED connection. For example, after 8 more iterations, there are now 9 ESTABLISHED connections:

python  6770 neil    3u  IPv4 837856      0t0      TCP localhost:40306->localhost:2379 (ESTABLISHED)
python  6770 neil    4u  IPv4 852249      0t0      TCP localhost:40374->localhost:2379 (ESTABLISHED)
python  6770 neil    5u  IPv4 851242      0t0      TCP localhost:40386->localhost:2379 (ESTABLISHED)
python  6770 neil    6u  IPv4 880526      0t0      TCP localhost:40580->localhost:2379 (ESTABLISHED)
python  6770 neil    7u  IPv4 884101      0t0      TCP localhost:40584->localhost:2379 (ESTABLISHED)
python  6770 neil    8u  IPv4 884108      0t0      TCP localhost:40588->localhost:2379 (ESTABLISHED)
python  6770 neil   10u  IPv4 884113      0t0      TCP localhost:40592->localhost:2379 (ESTABLISHED)
python  6770 neil   11u  IPv4 885316      0t0      TCP localhost:40596->localhost:2379 (ESTABLISHED)
python  6770 neil   12u  IPv4 880574      0t0      TCP localhost:40600->localhost:2379 (ESTABLISHED)

I believe this pattern continues without any bound, because in an overnight OpenStack churn test with networking-calico, the DHCP agent (which uses etcd3gw, and creates and cancels watches as above) was found to have more than 900 open connections to etcd.

nelljerram commented 6 years ago

I guess it's this, from http://docs.python-requests.org/en/latest/user/advanced/#keep-alive:

Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set stream to False or read the content property of the Response object.

But since we don't tell the server that we're no longer interested in the watch, I'm not sure there's a correct definition of "once all body data has been read", and in practice accessing <watcher>._response.content hangs for me.

I guess this is a known problem, and that this -

            if six.PY2:
                self._response.raw._fp.close()

is trying to work around it by reaching in and closing the socket by hand. But apparently it isn't working (with Python 2, which is what I'm using).

dims commented 6 years ago

hey @neiljerram do you have time to throw a debugger at it? else i'll try to get to it next week

nelljerram commented 6 years ago

@dims Thanks for reading and commenting! I'm happy to try debugging if you can give me a bit more guidance on what to do or look for... at the moment I'm afraid I don't know what you have in mind.

dims commented 6 years ago

@neiljerram oh, sorry. self._response.raw._fp.close() stick the debugger on that line and step in to see if that actually closes the file pointer or it throws an exception somewhere and the file pointer is not closed.

fasaxc commented 6 years ago

Hi @dims, I work with Neil, I'm picking this up and trying to find a workaround. I believe that close() is being called on the socket but it's still stuck open. I find that the code in the Python v3 branch of that function does seem to work for Python 2.7.12, the call to shutdown() seems to do the trick.

I'm guessing you put the conditional there for a reason though?

From the python docs:

Note close() releases the resource associated with a connection but does not necessarily close the connection immediately. If you want to close the connection in a timely fashion, call shutdown() before close().

It doesn't seem like shutdown is new, looks like it's always been there.

dims commented 6 years ago

@fasaxc if that works, then let's switch over to that. please make sure the unit tests work too. if i remember right, i had trouble with the unit tests too w/o that messy hack

fasaxc commented 6 years ago

I put up a PR; tox -e py27 passes for me locally. I noticed that the contributor guidelines say to follow the OpenStack process (via gerrit), is this an OpenStack project?

dims commented 6 years ago

@fasaxc at some point i'll propose this to oslo. until then, this is my personal repo.

dims / etcd3-gateway

Watch seems to leak connections #15