elastic / apm-agent-python

https://www.elastic.co/guide/en/apm/agent/python/current/index.html
BSD 3-Clause "New" or "Revised" License
414 stars 220 forks source link

v4.1.0 Introduces `http.client.RemoteDisconnected` error #409

Closed jensenbox closed 5 years ago

jensenbox commented 5 years ago

Describe the bug: ... Events getting reported to APM-Server are randomly getting rejected. It appears that every so often the event getting reported is being met with a RemoteDisconnected exception.

To Reproduce

  1. Setup a django project as per directions
  2. Hit the homepage of the project
  3. Wait for several events to hit the APM server
  4. One will eventually cause the exception

Expected behavior: ... No Exception

Environment (please complete the following information)

Additional context

downgrading to 4.0.3 remedies the issue.

Directly related: https://discuss.elastic.co/t/unable-to-reach-apm-server-connection-aborted-remotedisconnected-remote-end-closed-connection-without-response/166500

web_1      | Failed to submit message: "Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)"
web_1      | Traceback (most recent call last):
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
web_1      |     chunked=chunked)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
web_1      |     six.raise_from(e, None)
web_1      |   File "<string>", line 2, in raise_from
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
web_1      |     httplib_response = conn.getresponse()
web_1      |   File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
web_1      |     rv = real_getresponse(self, *args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
web_1      |     response.begin()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
web_1      |     version, status, reason = self._read_status()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
web_1      |     raise RemoteDisconnected("Remote end closed connection without"
web_1      | http.client.RemoteDisconnected: Remote end closed connection without response
web_1      | 
web_1      | During handling of the above exception, another exception occurred:
web_1      | 
web_1      | Traceback (most recent call last):
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 40, in send
web_1      |     "POST", url, body=data, headers=self._headers, timeout=self._timeout, preload_content=False
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 323, in urlopen
web_1      |     response = conn.urlopen(method, u.request_uri, **kw)
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
web_1      |     return wrapped(*args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
web_1      |     return wrapped(*args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
web_1      |     _stacktrace=sys.exc_info()[2])
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
web_1      |     raise six.reraise(type(error), error, _stacktrace)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
web_1      |     raise value.with_traceback(tb)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
web_1      |     chunked=chunked)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
web_1      |     six.raise_from(e, None)
web_1      |   File "<string>", line 2, in raise_from
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
web_1      |     httplib_response = conn.getresponse()
web_1      |   File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
web_1      |     rv = real_getresponse(self, *args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
web_1      |     response.begin()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
web_1      |     version, status, reason = self._read_status()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
web_1      |     raise RemoteDisconnected("Remote end closed connection without"
web_1      | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
web_1      | 
web_1      | During handling of the above exception, another exception occurred:
web_1      | 
web_1      | Traceback (most recent call last):
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/base.py", line 184, in send_sync
web_1      |     self.sync_transport.send(self, data)
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 53, in send
web_1      |     raise TransportException(message, data, print_trace=print_trace)
web_1      | elasticapm.transport.base.TransportException: Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)
beniwohli commented 5 years ago

Hi @jensenbox. Thanks for opening the issue! I'm currently investigating this problem, but haven't found a good reason for why this happens more frequently in 4.1.0.

Ultimately, I think the solution will be to introduce a retry mechanism for these cases, but I need to investigate the root issue a bit more before taking that step.

jensenbox commented 5 years ago

Retry feels hacky to me. My gut tells me it is something to do with keep alive or the connection pool inside urllib3.

I recall something about setting stream=true somewhere to make this work.

I am more than happy to give a branch a whirl if you wish. It is easy to repro for me.

On Mon, Feb 11, 2019, 1:14 AM Benjamin Wohlwend, notifications@github.com wrote:

Hi @jensenbox https://github.com/jensenbox. Thanks for opening the issue! I'm currently investigating this problem, but haven't found a good reason for why this happens more frequently in 4.1.0.

Ultimately, I think the solution will be to introduce a retry mechanism for these cases, but I need to investigate the root issue a bit more before taking that step.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/apm-agent-python/issues/409#issuecomment-462259458, or mute the thread https://github.com/notifications/unsubscribe-auth/AALjUW_f-pk7xwRrtvsqW-L-bKgMikZOks5vMTSMgaJpZM4ay6b1 .

beniwohli commented 5 years ago

@jensenbox that would be great! I haven't managed to reproduce it yet. Keep-alive indeed sounds like it could be involved, but theoretically, urllib3 should recycle dead "keep alive" connections automatically. Also, it should notice before sending the request, not while waiting for the response...

jensenbox commented 5 years ago

should :)

On Mon, Feb 11, 2019, 5:28 AM Benjamin Wohlwend, notifications@github.com wrote:

@jensenbox https://github.com/jensenbox that would be great! I haven't managed to reproduce it yet. Keep-alive indeed sounds like it could be involved, but theoretically, urllib3 should recycle dead "keep alive" connections automatically. Also, it should notice before sending the request, not while waiting for the response...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/apm-agent-python/issues/409#issuecomment-462326948, or mute the thread https://github.com/notifications/unsubscribe-auth/AALjUeDkHBMAtjg3gm02fqS_tRVq_iT2ks5vMXAHgaJpZM4ay6b1 .

pjankows commented 5 years ago

I have a reliable but unfortunately very slow reproduction scenario. Assuming you have an apm server running locally (I have 6.6.0 in docker) put this in main.py:

from flask import Flask
from elasticapm.contrib.flask import ElasticAPM

app = Flask(__name__)
app.config['ELASTIC_APM'] = {
    'SERVICE_NAME': 'app'
}
apm = ElasticAPM(app)

@app.route('/')
def hello_world():
    return 'Hello, World!'

Run this with: gunicorn -b 127.0.0.1:8080 main:app

Then issue a single request ex.: curl http://127.0.0.1:8080

This should establish a TCP connection that will to go from ESTABLISHED to TIME_WAIT state after 5 minutes and you'll get this in gunicorn log:

Failed to submit message: "Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) (url: http://localhost:8200/intake/v2/events)"
Traceback (most recent call last):
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/transport/http.py", line 40, in send
    "POST", url, body=data, headers=self._headers, timeout=self._timeout, preload_content=False
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
    return wrapped(*args, **kwargs)
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/transport/base.py", line 184, in send_sync
    self.sync_transport.send(self, data)
  File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/transport/http.py", line 53, in send
    raise TransportException(message, data, print_trace=print_trace)
elasticapm.transport.base.TransportException: Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) (url: http://localhost:8200/intake/v2/events)
pjankows commented 5 years ago

I managed to make a short reproduction script:

#!/bin/bash

logfile="gunicorn.log"
bind_addr="127.0.0.1:8080"

rm -f "$logfile"
sleep 1
gunicorn -b "$bind_addr" main:app > $logfile 2>&1 &
gunicorn_pid=$!
echo "started gunicorn on pid $gunicorn_pid"
sleep 1

curl "http://$bind_addr"
for i in {1..30}
do
        echo -n "$i "
        sleep 10
done
echo
kill "$gunicorn_pid"

! grep "http.client.RemoteDisconnected" "$logfile"

I run this through git bisect between v4.1.0 v4.0.3 and this is the result:

faea3fb13f8d7f86a1914dfe0e738fa3877092f8 is the first bad commit
commit faea3fb13f8d7f86a1914dfe0e738fa3877092f8
Author: Benjamin Wohlwend <beni@elastic.co>
Date:   Wed Jan 2 09:55:59 2019 +0100

    Added metrics type, and created system and process metric set (#361)

    Both a Linux-specific metric set (using the /proc virtual file system) as well as a generic
    metric set using psutil are provided, the former being preferred on Linux.

    closes #361
    fixes #245

:100644 100644 a98e86ba1a61e44a4217aa5f9be33436eefe234f 87ecea9c36d6dc6dcea962e3f887a1d359a848fa M  .appveyor.yml
:100644 100644 248c6ced7e23bff2dd03355168d6548038accb09 0ff86c03efc5a73bde77c92cc0ff21f4ab796a86 M  CHANGELOG.md
:100644 100644 fbe25e6f3ec33230998922235446aceb4d65688c a8184ceade7cd966e9f05761cc3ea971aba25f02 M  conftest.py
:040000 040000 01122dfc7acdb2b93e9eaf82765efb605772a079 f02b277adb3f96490df56b45e0fb20cc9d746777 M  docs
:040000 040000 8941b830a7f3814f9c920a849b29f3df09a0ee45 8ea4ec4842a367838b35258e5d107c11a95d0111 M  elasticapm
:040000 040000 0a891358b11f7019aada787cfd91fdd63796200f f6a6c5ab2a8bb42ddbe6d67241d1fda6d32f9582 M  tests
bisect run success

Hope this helps.

beniwohli commented 5 years ago

Fantastic work @reveil!

I tried the script but it runs through without issue, the socket remains in ESTABLISHED state throughout the whole run.

Under which environment do you run this? e.g. Linux, OSX, docker on OSX/Linux, ...

jensenbox commented 5 years ago

For me it happens with docker "python:3" on Mojave.

On Tue, Feb 12, 2019, 8:32 AM Benjamin Wohlwend, notifications@github.com wrote:

Fantastic work @reveil https://github.com/reveil!

I tried the script but it runs through without issue, the socket remains in ESTABLISHED state throughout the whole run.

Under which environment do you run this? e.g. Linux, OSX, docker on OSX/Linux, ...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/apm-agent-python/issues/409#issuecomment-462831141, or mute the thread https://github.com/notifications/unsubscribe-auth/AALjUeJp_lLa5Xo92yzrQDmfSYKXiMl3ks5vMuyRgaJpZM4ay6b1 .

pjankows commented 5 years ago

I reproduced it on Linux on several distributions (CentOS/Ubuntu/Fedora) both with and without docker on python 3.6 and 3.7. The server was running both locally and on a remote host but always in docker.

I'm currently reproducing this on Fedora29 with latest updates using elastic-apm git master. My environment is:

(envapm) [pjankows@kadlubek ~]$ python --version
Python 3.7.2
(envapm) [pjankows@kadlubek ~]$ pip list
Package      Version    Location                       
------------ ---------- -------------------------------
blinker      1.4        
certifi      2018.11.29 
Click        7.0        
elastic-apm  4.1.0      /home/pjankows/apm-agent-python
Flask        1.0.2      
gunicorn     19.9.0     
itsdangerous 1.1.0      
Jinja2       2.10       
MarkupSafe   1.1.0      
pip          19.0.2     
setuptools   40.8.0     
urllib3      1.24.1     
Werkzeug     0.14.1     
wheel        0.32.3     
(envapm) [pjankows@kadlubek ~]$ ./repro.sh 
started gunicorn on pid 18542
Hello, World!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response
(envapm) [pjankows@kadlubek ~]$ echo $?
1

I can also reproduce it with docker with the following Dockerfile:

FROM python:3.7
RUN pip install flask gunicorn blinker elastic-apm
COPY main.py /
COPY repro.sh /
CMD /repro.sh

Versions while building: Successfully installed Jinja2-2.10 MarkupSafe-1.1.0 Werkzeug-0.14.1 blinker-1.4 certifi-2018.11.29 click-7.0 elastic-apm-4.1.0 flask-1.0.2 gunicorn-19.9.0 itsdangerous-1.1.0 urllib3-1.24.1

When I run it (I use host network not to make docker networking complicate things):

pjankows@kadlubek docker]$ docker run -it --network=host apmtest 
started gunicorn on pid 9
Hello, World!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response

The apm server is run with the following docker-compose.yml:

version: '2.2'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
    environment:
      - "discovery.type=single-node"
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata1:/usr/share/elasticsearch/data
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:9200']
      interval: 10s
      timeout: 5s
      retries: 3
  kibana:
    image: docker.elastic.co/kibana/kibana:6.6.0
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - 5601:5601
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:5601']
      interval: 10s
      timeout: 5s
      retries: 3
  apm:
    image: docker.elastic.co/apm/apm-server:6.6.0
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - 8200:8200
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8200']
      interval: 10s
      timeout: 5s
      retries: 3

volumes:
  esdata1:
    driver: local
pjankows commented 5 years ago

I also managed to capture some interesting packets with wireshark: wireshark It looks like the problem happens when the TCP connection is closed before a 202 HTTP response can be sent for the metrics POST request.

I also saw cases (not shown here) where the TCP connection was terminated after the 202 response then the client issued a SYN packet and a new connection was established.

jensenbox commented 5 years ago

One more thing to add to the confusion... I just got the error on 4.0.3

[04:02:14][DEBUG] elasticapm.transport base.py:_start_flush_timer:156 | Starting flush timer
[04:02:14][ERROR] elasticapm.transport base.py:handle_transport_fail:148 | Failed to submit message: "Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)"
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 40, in send
    "POST", url, body=data, headers=self._headers, timeout=self._timeout, preload_content=False
  File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/base.py", line 181, in send_sync
    self.sync_transport.send(self, data)
  File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 53, in send
    raise TransportException(message, data, print_trace=print_trace)
elasticapm.transport.base.TransportException: Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)
[04:02:14][DEBUG] urllib3.connectionpool connectionpool.py:_new_conn:813 | Starting new HTTPS connection (1): sentry.io:443
[04:02:14][DEBUG] urllib3.connectionpool connectionpool.py:_make_request:393 | https://sentry.io:443 "POST /api/1286746/store/ HTTP/1.1" 200 41
[04:02:24][DEBUG] elasticapm.transport base.py:_stop_flush_timer:161 | Cancelling flush timer

Perhaps it has been there all along and something in the latest just made it worse.

pjankows commented 5 years ago

Updated the script so you don't have to wait till the end if the RemoteDisconnected exception occurs:

#!/bin/bash

logfile="gunicorn.log"
bind_addr="127.0.0.1:8080"

rm -f "$logfile"
sleep 1
gunicorn -t 2 -b "$bind_addr" main:app > $logfile 2>&1 &
gunicorn_pid=$!
echo "started gunicorn on pid $gunicorn_pid"
sleep 1

curl "http://$bind_addr"
for i in {1..36}
do
        echo -n "$i "
        sleep 10
        grep "http.client.RemoteDisconnected" "$logfile" && echo && kill "$gunicorn_pid" && exit 1
        #curl "http://$bind_addr" # when this is enabled the problem does not appear
done
echo
kill "$gunicorn_pid"

! grep "http.client.RemoteDisconnected" "$logfile"

Also got some additional thoughts/information. Unfortunately my reproduction script is not 100% reliable but very close. Increasing the for loop count makes it more reliable at the cost of making a good pass longer. This looks like a race condition of some sort. I'm starting to wonder whenever this is the fault of the apm agent or rather the apm server. I'm also unable to reproduce this error on 4.0.3.

Furthermore the problem does not appear to manifest itself when there are other requests to the flask server (ex. every 10s) or maybe rather other transactions' requests to the apm server made by the apm agent as a result of the traffic.

Looking more closely at the wireshark logs it looks like the apm server sent the first FIN packet so from my understanding it is the server that closed the TCP connection. The HTTP protocol has a problem now because it just sent a POST request that did not receive a response at the HTTP level. If this were a GET request it would simply be re-sent but according to the HTTP spec it can't do that for POST: https://tools.ietf.org/html/rfc2616#section-8.1.4

Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2). Non-idempotent methods or sequences MUST NOT be automatically retried

It makes sense from the protocol point of view. A POST request (which is non-idempotent) just got sent and did not get a response. The request may have reached the server or not and the client has no way of knowing this. It can't just retry it as a POST may have already worked and double POST may have unintended consequences. So the behavior of urllib is correct - throw an exception rather than retry.

The question that remains - if I'm understanding this correctly - why did the apm server send the FIN packet and closed the TCP connection? My theory is (haven't looked at sources yet) that maybe the APM server looks at transactions (so metrics do not count?) rather than http requests as a condition to close or leave the TCP connection open. This would mean that the race condition is closing the TCP connection between the POST request and response.

Any thoughts how to proceed?

beniwohli commented 5 years ago

@reveil great analysis, thanks!

I just managed to replicate the issue on my local machine, but only if a) I put the APM Server under a lot of load with a load generator tool and b) put the gunicorn server under a lot of load (using ab). With these two load generators running, I see the error after a couple of minutes. Seeing as your access patterns to trigger the error are completely different, it might actually be a different problem with the same symptom...

I share your suspicion that this might actually be a problem with APM Server. I asked @graphaelli, one of our APM Server devs, to investigate if there are any scenarios where the APM Server would close the connection without returning a response.

pjankows commented 5 years ago

I started to look at APM server a little bit closely and tried version 7.0.0-alpha2 and tried to find something relevant in server debug logs but got nothing valuable so went back to 6.6.0.

Then I found this setting: https://www.elastic.co/guide/en/apm/server/current/configuration-process.html#read_timeout When changed from default 30s to: read_timeout: 45s I'm now unable to reproduce the issue with my script. Also there are no more FIN packets from the server in wireshark. Even those not harmful between HTTP requests.

I also noticed that the 30s may very well be racing with 30000ms from here: https://github.com/elastic/apm-agent-python/blob/v4.1.0/elasticapm/conf/__init__.py#L223

Also changing it to 20000ms makes the issue disappear even if read_timeout on the server is still 30s. Not sure if changing it is desired though.

So maybe to workaround this issue either of these setting need to be changed so that the server read_timeout is just longer than agent metrics_interval by changing either of these settings. I'm not sure if this is a proper fix but much better than the current situation.

As for reproduction on your side maybe using the docker-compose.yml for the apm server I posted above is required? It's also somehow a little bit harder to reproduce if apm server debug logging is enabled.

Also if there is some ticket on apm-server side could you please link it here?

fr0der1c commented 5 years ago

I wonder when this problem will be fixed. My log is literally full of these exceptions.

ciarancourtney commented 5 years ago

Having the same problem in v4.1.0, although only for crons that run a Django command

beniwohli commented 5 years ago

Sorry for the lack of progress with this folks! Unfortunately, there are a few things with higher priority on my plate at the moment.

@fr0der1c under how much load is your APM Server when this happens? So far, I was only able to reproduce the issue when the server was under very high load.

@reveil I have a hard time understanding how the metrics collection interval has any effect on the HTTP transport. They aren't connected directly. When metrics are collected, they are put on a queue which then is processed by a thread. It only triggers a request if the metrics event causes the queue size to reach an upper limit. Also, there is currently a bug which forces a request every 10 seconds, even if nothing happened it between.

However, we are currently working on a major refactoring of the queueing mechanics (#411). With a bit of luck, it might help with this issue.

pjankows commented 5 years ago

@beniwohli The real reason why this bug appears is the way the HTTP protocol works with TCP. Basically when any HTTP/1.1 (or later) request is made the server keeps the TCP socket open.

Currently the default timeout for this TCP connection on the APM Server is 30s. The client may decide to send a new HTTP POST request (any request metrics or any other) in at the same time the server decides to terminate the TCP connection. And it just so happens that metrics are sent every 30s which makes this much more likely to happen. This is a race condition of 3 things:

  1. The HTTP POST client request
  2. The TCP FIN packet (which terminates the connection) - this is a result of server read_timeout
  3. The server's HTTP response for point nr 1.

The error is the TCP FIN (from the server) comes after the HTTP request and before HTTP response. So if this happens in any other order the bug is not visible. Note the problem is that 3 can't happen after 2 as the socket is now closed and client can't get a response.

You can fix this in several ways:

  1. Have the server timeout increased so it is larger than metrics interval (I proposed a patch to set it to 45s on the server) <-- this is imho the best solution
  2. Send the metrics more frequently (ex. every 15-20s) - not sure if possible or desired as this may alter results on the server - I have not checked this but it seems problematic
  3. Before sending any HTTP POST send any HTTP GET (or even HEAD if APM server supports it) so that a TCP socket is established - this is a big overhead if done without 4
  4. Keep an timer on the client that if no requests were sent for ex. 20s make an additional GET request to keep the TCP socket from closing

Note that either solution 1 or 2 work only when metrics are ENABLED (regular requests more frequent than timeout keep the TCP socket open). So metrics may fix this problem instead of making it more visible (it was here before the metrics but much harder to hit).

beniwohli commented 5 years ago

@reveil sure, but the thing is, we don't send metrics every 30 seconds, we collect them every 30 seconds. Then they are queued up for sending, which, depending on the current queue state, can happen immediately (if the queue was almost full) or up to 10 seconds (if we just sent a request). That's what I tried to say above when I wrote that metrics collection is decoupled from sending.

If there is a race condition, I think it is more probable to come from the 10s timer to send requests, which lines up with the server timeout every 3rd time. What we could do here is to add random(-1, 1) to the timer, which would make it less common for them to sync up exactly.

pjankows commented 5 years ago

@beniwohli Ok so they are decoupled. But what if the queue is empty and there are no other requests? Then after 30s they are collected and immediately sent right? So the net result is they are sent every 30s in this scenario.

pjankows commented 5 years ago

@beniwohli Also note the root cause of the problem is not the metrics sending. You can just as well have transactions at regular intervals (common in APIs) and trigger this bug. Someone managed to do it on 4.0.3 so it is possible without metrics.

The problem is that there exists a scenario when the TCP connection can be closed at all. The best solution is to never allow it to happen as when the TCP FIN is sent you can't guarantee that a HTTP POST is not send at the same time which is itself a race condition. A workaround is sending an extra GET/HEAD just before POST to make sure the TCP socket is open at the time of POST.

beniwohli commented 5 years ago

@reveil not at the moment, due to a bug that causes a request every 10s even if the queue is empty (among a lot of other things, that bug is fixed in #411)

Sending a HEAD request before the POST sounds interesting. I'll look into that.

pjankows commented 5 years ago

@beniwohli Is this bug of sending every 10s just on master or also in 4.1.0? Probably not in 4.1.0.

Actually this behaviour you describe would fix this #409. The reason is a request every 10s would keep the TCP socket open. If you tried to reproduce this bug with an extra request every 10s then it is impossible to trigger. Actually this bug may actually be a fix. Maybe consider keeping it as a feature?

The best way to send the HEAD is to have a timer on the client/agent so that you send it only when ex. 20s has passed since the last request. But that can of course come as a future improvement as fixing this is probably a priority.

beniwohli commented 5 years ago

That bug is in there since 4.0, so it doesn't help with #409 I'm afraid

pjankows commented 5 years ago

@beniwohli Very strange I definitely got a metrics request exactly every 30s when there is no other transactions. You can see proof on the wireshark packet capture above. You can see 30s then 60s in the time column.

pjankows commented 5 years ago

@beniwohli I probably misunderstood the 10s bug you described. Which one it is:

  1. It makes a request every 10s even if there are neither transactions nor metrics happening? OR
  2. Adds a 10s delay after collecting to sending the metrics?

I initially assumed 1 but if 2 is really the case then after a couple first metrics sent sending will still happen in 30s intervals just moved 10s from the collection. This has no effect on this bug as only the network and HTTP and TCP protocols are involved here and it does not matter when these were collected.

hylarucoder commented 5 years ago

@reveil can you curl your http://yourappserver:port/ and paste the response content

beniwohli commented 5 years ago

@reveil @ciarancourtney @fr0der1c sorry for the silence here, I was on PTO last week. In the mean time, @simitt was investigating this issue from the server side, and a possible fix could be to use an idle timeout instead of a read timeout. Silvia also proposed to set SO_KEEPALIVE on the underlying TCP socket, which would send a ping over the TCP connection every so often to, well, keep it alive.

I've implemented this in #422. Would it be possible for you folks to give this branch a try and see if it fixes the issue for you?

MrBones757 commented 5 years ago

I'm also experiencing this issue with apm client embedded in a python flask API server. ElasticAPM Server is running as a 4 node daemonset on kubernetes exposed as a NodePort Service. APM server is under NO load as this was deployed specifically for the API server.

If i can contribute any additional information let me know! looking forward to a fix for this so i can play around with APM for this deployment.

beniwohli commented 5 years ago

@MrBones757 if you could try the branch from #422 in your environment to see if it fixes the issue, that would be super helpful. You can install the branch e.g. via pip:

pip install "https://github.com/beniwohli/apm-agent-python/archive/tcp-keepalive.zip#egg=elastic-apm"
ciarancourtney commented 5 years ago

@beniwohli I applied the patch manually and didn't notice any difference since yesterday for my cron process that calls into Django, but would I need to do a full install like above to burst the pycache or given that its a cron process that would be redundant?

beniwohli commented 5 years ago

@ciarancourtney I don't think the bytecode cache would be an issue in this case, so that would unfortunately be an indication that #422 doesn't help fixing the problem :(

ciarancourtney commented 5 years ago

Just did a full install as above to confirm. Note you may need to pip install -U to overwrite the existing install

ciarancourtney commented 5 years ago

Nope, just got an error 10min ago :(

MrBones757 commented 5 years ago

I was able to resolve this issue for now on the current stable version. I updated my APM Server config to be the following:

I have not observed any issues to date, though im working on deploying it to my environment to test under alittle more load. Given my app is still in development though i'm not sure ill be able to replicate appropriate load

''' apm-server: read_timeout: 45s write_timeout: 45s max_connections: 0 host: 0.0.0.0:8200 rum: enabled: true output: elasticsearch: enabled: true hosts:

ciarancourtney commented 5 years ago

Some of @MrBones757 settings worked for me too, specifically:

apm-server:
  read_timeout: 45s
  write_timeout: 45s
  max_connections: 0
beniwohli commented 5 years ago

@ciarancourtney @MrBones757 sorry to bug you folks again. I've implemented another approach that might work: https://github.com/elastic/apm-agent-python/pull/426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.

pip install "https://github.com/beniwohli/apm-agent-python/archive/fuzzy-api-request-time.zip#egg=elastic-apm"
donAFRO commented 5 years ago

@ciarancourtney @MrBones757 sorry to bug you folks again. I've implemented another approach that might work: #426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.

pip install "https://github.com/beniwohli/apm-agent-python/archive/fuzzy-api-request-time.zip#egg=elastic-apm"

i was getting around 15 messages with this error per hour but your solution seems to work for me, in last hour i did not receive any of this error, thx

ciarancourtney commented 5 years ago

@ciarancourtney @MrBones757 sorry to bug you folks again. I've implemented another approach that might work: #426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.

pip install "https://github.com/beniwohli/apm-agent-python/archive/fuzzy-api-request-time.zip#egg=elastic-apm"

Yep, seems fine for last 7hrs (reverted my apm server conf back to defaults)

pjankows commented 5 years ago

Sorry for the late response.

I was not able to reproduce the issue with my script running for 1h on fuzzy-api-request-time branch. Usually the error appeared after about 5 minutes. I'm not sure this is a proper solution though compared to setting read_timeout: 45s on the apm server. As a workaround it is good - especially if run with server default settings.

To validate it properly I will try to debug it with wireshark and try to find FIN packets from the server. If there are any there is a chance of this bug to trigger (can't estimate how big - probably same as on 4.0.3 - remember someone got this on 4.0.3). Setting larger read timeout on the server prevented the FIN TCP packets from appearing - this is guarantee this bug will not happen. I will let you know later if this branch also prevents FIN packets or not.

beniwohli commented 5 years ago

Awesome, thanks for the update @reveil, and a huge thanks for all the time you already invested in investigating his issue!

pjankows commented 5 years ago

@beniwohli Unfortunately FIN packets from the server are still present fuzzy-fin-syn You can see a TCP FIN originating from 8200 followed by a SYN from the client. While this is not a problem in this case (as FIN arrived after HTTP response) there is still a risk (although much smaller) that the FIN packet may end up between HTTP request and response and cause this bug.

I would advise fixing defaults on the server side as well. I have been running in production with read_timeout: 45s for a month now and have not seen any issues since.

beniwohli commented 5 years ago

@reveil awesome, thanks! As far as I understand, a FIN per se isn't a problem, if it is only sent after the HTTP response was received. In this case, urllib3 will recognize that the connection was closed and recycle it.

I will go ahead and merge #426, even if it isn't a 100% fix for the problem. It seems to at least substantially reduce the number of connection issues. It'll take some time until a server side fix can be released (currently slated for 7.1, which is some time out).

MrBones757 commented 5 years ago

Thaks for the change & apologies for the slow reply. I can confirm the changes also worked for me in my environment. Much appreciated!

beniwohli commented 5 years ago

Thanks again all for your help debugging this! I just released version 4.2.0 of the agent which contains #426 and a bunch of other goodies.

seocam commented 5 years ago

I'm running version 4.2.2 of elastic-apm with apm-server 6.7.1 and I'm experiencing that.

beniwohli commented 5 years ago

@seocam ugh, sorry to hear :( how often do you see the error?

seocam commented 5 years ago

We've installed the APM on our servers yesterday and got the error 6 times on Sentry since then. Now that I'm checking I could realize that so far it only affected Celery servers. Since yesterday we had around 600 tasks processed so it's about 1%.

pjankows commented 5 years ago

@seocam Could you try running the APM server with a 45s read_timeout? This is perfect workaround for me at least until an APM server version containing the fix: https://github.com/elastic/apm-server/pull/2122 is released.

seocam commented 5 years ago

Just set. I'll let you know the outcomes! Thanks