Closed jensenbox closed 5 years ago
Hi @jensenbox. Thanks for opening the issue! I'm currently investigating this problem, but haven't found a good reason for why this happens more frequently in 4.1.0.
Ultimately, I think the solution will be to introduce a retry mechanism for these cases, but I need to investigate the root issue a bit more before taking that step.
Retry feels hacky to me. My gut tells me it is something to do with keep alive or the connection pool inside urllib3.
I recall something about setting stream=true somewhere to make this work.
I am more than happy to give a branch a whirl if you wish. It is easy to repro for me.
On Mon, Feb 11, 2019, 1:14 AM Benjamin Wohlwend, notifications@github.com wrote:
Hi @jensenbox https://github.com/jensenbox. Thanks for opening the issue! I'm currently investigating this problem, but haven't found a good reason for why this happens more frequently in 4.1.0.
Ultimately, I think the solution will be to introduce a retry mechanism for these cases, but I need to investigate the root issue a bit more before taking that step.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/apm-agent-python/issues/409#issuecomment-462259458, or mute the thread https://github.com/notifications/unsubscribe-auth/AALjUW_f-pk7xwRrtvsqW-L-bKgMikZOks5vMTSMgaJpZM4ay6b1 .
@jensenbox that would be great! I haven't managed to reproduce it yet. Keep-alive indeed sounds like it could be involved, but theoretically, urllib3 should recycle dead "keep alive" connections automatically. Also, it should notice before sending the request, not while waiting for the response...
should :)
On Mon, Feb 11, 2019, 5:28 AM Benjamin Wohlwend, notifications@github.com wrote:
@jensenbox https://github.com/jensenbox that would be great! I haven't managed to reproduce it yet. Keep-alive indeed sounds like it could be involved, but theoretically, urllib3 should recycle dead "keep alive" connections automatically. Also, it should notice before sending the request, not while waiting for the response...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/apm-agent-python/issues/409#issuecomment-462326948, or mute the thread https://github.com/notifications/unsubscribe-auth/AALjUeDkHBMAtjg3gm02fqS_tRVq_iT2ks5vMXAHgaJpZM4ay6b1 .
I have a reliable but unfortunately very slow reproduction scenario. Assuming you have an apm server running locally (I have 6.6.0 in docker) put this in main.py:
from flask import Flask
from elasticapm.contrib.flask import ElasticAPM
app = Flask(__name__)
app.config['ELASTIC_APM'] = {
'SERVICE_NAME': 'app'
}
apm = ElasticAPM(app)
@app.route('/')
def hello_world():
return 'Hello, World!'
Run this with:
gunicorn -b 127.0.0.1:8080 main:app
Then issue a single request ex.:
curl http://127.0.0.1:8080
This should establish a TCP connection that will to go from ESTABLISHED to TIME_WAIT state after 5 minutes and you'll get this in gunicorn log:
Failed to submit message: "Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) (url: http://localhost:8200/intake/v2/events)"
Traceback (most recent call last):
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/transport/http.py", line 40, in send
"POST", url, body=data, headers=self._headers, timeout=self._timeout, preload_content=False
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/poolmanager.py", line 323, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
return wrapped(*args, **kwargs)
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/util/retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/home/pjankows/swbot/apm/apmenv/lib/python3.6/site-packages/urllib3-1.24.1-py3.6.egg/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/transport/base.py", line 184, in send_sync
self.sync_transport.send(self, data)
File "/home/pjankows/swbot/apm/apm-agent-python/elasticapm/transport/http.py", line 53, in send
raise TransportException(message, data, print_trace=print_trace)
elasticapm.transport.base.TransportException: Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) (url: http://localhost:8200/intake/v2/events)
I managed to make a short reproduction script:
#!/bin/bash
logfile="gunicorn.log"
bind_addr="127.0.0.1:8080"
rm -f "$logfile"
sleep 1
gunicorn -b "$bind_addr" main:app > $logfile 2>&1 &
gunicorn_pid=$!
echo "started gunicorn on pid $gunicorn_pid"
sleep 1
curl "http://$bind_addr"
for i in {1..30}
do
echo -n "$i "
sleep 10
done
echo
kill "$gunicorn_pid"
! grep "http.client.RemoteDisconnected" "$logfile"
I run this through git bisect between v4.1.0 v4.0.3 and this is the result:
faea3fb13f8d7f86a1914dfe0e738fa3877092f8 is the first bad commit
commit faea3fb13f8d7f86a1914dfe0e738fa3877092f8
Author: Benjamin Wohlwend <beni@elastic.co>
Date: Wed Jan 2 09:55:59 2019 +0100
Added metrics type, and created system and process metric set (#361)
Both a Linux-specific metric set (using the /proc virtual file system) as well as a generic
metric set using psutil are provided, the former being preferred on Linux.
closes #361
fixes #245
:100644 100644 a98e86ba1a61e44a4217aa5f9be33436eefe234f 87ecea9c36d6dc6dcea962e3f887a1d359a848fa M .appveyor.yml
:100644 100644 248c6ced7e23bff2dd03355168d6548038accb09 0ff86c03efc5a73bde77c92cc0ff21f4ab796a86 M CHANGELOG.md
:100644 100644 fbe25e6f3ec33230998922235446aceb4d65688c a8184ceade7cd966e9f05761cc3ea971aba25f02 M conftest.py
:040000 040000 01122dfc7acdb2b93e9eaf82765efb605772a079 f02b277adb3f96490df56b45e0fb20cc9d746777 M docs
:040000 040000 8941b830a7f3814f9c920a849b29f3df09a0ee45 8ea4ec4842a367838b35258e5d107c11a95d0111 M elasticapm
:040000 040000 0a891358b11f7019aada787cfd91fdd63796200f f6a6c5ab2a8bb42ddbe6d67241d1fda6d32f9582 M tests
bisect run success
Hope this helps.
Fantastic work @reveil!
I tried the script but it runs through without issue, the socket remains in ESTABLISHED
state throughout the whole run.
Under which environment do you run this? e.g. Linux, OSX, docker on OSX/Linux, ...
For me it happens with docker "python:3" on Mojave.
On Tue, Feb 12, 2019, 8:32 AM Benjamin Wohlwend, notifications@github.com wrote:
Fantastic work @reveil https://github.com/reveil!
I tried the script but it runs through without issue, the socket remains in ESTABLISHED state throughout the whole run.
Under which environment do you run this? e.g. Linux, OSX, docker on OSX/Linux, ...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/apm-agent-python/issues/409#issuecomment-462831141, or mute the thread https://github.com/notifications/unsubscribe-auth/AALjUeJp_lLa5Xo92yzrQDmfSYKXiMl3ks5vMuyRgaJpZM4ay6b1 .
I reproduced it on Linux on several distributions (CentOS/Ubuntu/Fedora) both with and without docker on python 3.6 and 3.7. The server was running both locally and on a remote host but always in docker.
I'm currently reproducing this on Fedora29 with latest updates using elastic-apm git master. My environment is:
(envapm) [pjankows@kadlubek ~]$ python --version
Python 3.7.2
(envapm) [pjankows@kadlubek ~]$ pip list
Package Version Location
------------ ---------- -------------------------------
blinker 1.4
certifi 2018.11.29
Click 7.0
elastic-apm 4.1.0 /home/pjankows/apm-agent-python
Flask 1.0.2
gunicorn 19.9.0
itsdangerous 1.1.0
Jinja2 2.10
MarkupSafe 1.1.0
pip 19.0.2
setuptools 40.8.0
urllib3 1.24.1
Werkzeug 0.14.1
wheel 0.32.3
(envapm) [pjankows@kadlubek ~]$ ./repro.sh
started gunicorn on pid 18542
Hello, World!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response
(envapm) [pjankows@kadlubek ~]$ echo $?
1
I can also reproduce it with docker with the following Dockerfile:
FROM python:3.7
RUN pip install flask gunicorn blinker elastic-apm
COPY main.py /
COPY repro.sh /
CMD /repro.sh
Versions while building:
Successfully installed Jinja2-2.10 MarkupSafe-1.1.0 Werkzeug-0.14.1 blinker-1.4 certifi-2018.11.29 click-7.0 elastic-apm-4.1.0 flask-1.0.2 gunicorn-19.9.0 itsdangerous-1.1.0 urllib3-1.24.1
When I run it (I use host network not to make docker networking complicate things):
pjankows@kadlubek docker]$ docker run -it --network=host apmtest
started gunicorn on pid 9
Hello, World!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response
http.client.RemoteDisconnected: Remote end closed connection without response
The apm server is run with the following docker-compose.yml:
version: '2.2'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
environment:
- "discovery.type=single-node"
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- esdata1:/usr/share/elasticsearch/data
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:9200']
interval: 10s
timeout: 5s
retries: 3
kibana:
image: docker.elastic.co/kibana/kibana:6.6.0
depends_on:
elasticsearch:
condition: service_healthy
ports:
- 5601:5601
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:5601']
interval: 10s
timeout: 5s
retries: 3
apm:
image: docker.elastic.co/apm/apm-server:6.6.0
depends_on:
elasticsearch:
condition: service_healthy
ports:
- 8200:8200
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8200']
interval: 10s
timeout: 5s
retries: 3
volumes:
esdata1:
driver: local
I also managed to capture some interesting packets with wireshark: It looks like the problem happens when the TCP connection is closed before a 202 HTTP response can be sent for the metrics POST request.
I also saw cases (not shown here) where the TCP connection was terminated after the 202 response then the client issued a SYN packet and a new connection was established.
One more thing to add to the confusion... I just got the error on 4.0.3
[04:02:14][DEBUG] elasticapm.transport base.py:_start_flush_timer:156 | Starting flush timer
[04:02:14][ERROR] elasticapm.transport base.py:handle_transport_fail:148 | Failed to submit message: "Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)"
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
rv = real_getresponse(self, *args, **kwargs)
File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 40, in send
"POST", url, body=data, headers=self._headers, timeout=self._timeout, preload_content=False
File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 323, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
rv = real_getresponse(self, *args, **kwargs)
File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/base.py", line 181, in send_sync
self.sync_transport.send(self, data)
File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 53, in send
raise TransportException(message, data, print_trace=print_trace)
elasticapm.transport.base.TransportException: Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)
[04:02:14][DEBUG] urllib3.connectionpool connectionpool.py:_new_conn:813 | Starting new HTTPS connection (1): sentry.io:443
[04:02:14][DEBUG] urllib3.connectionpool connectionpool.py:_make_request:393 | https://sentry.io:443 "POST /api/1286746/store/ HTTP/1.1" 200 41
[04:02:24][DEBUG] elasticapm.transport base.py:_stop_flush_timer:161 | Cancelling flush timer
Perhaps it has been there all along and something in the latest just made it worse.
Updated the script so you don't have to wait till the end if the RemoteDisconnected exception occurs:
#!/bin/bash
logfile="gunicorn.log"
bind_addr="127.0.0.1:8080"
rm -f "$logfile"
sleep 1
gunicorn -t 2 -b "$bind_addr" main:app > $logfile 2>&1 &
gunicorn_pid=$!
echo "started gunicorn on pid $gunicorn_pid"
sleep 1
curl "http://$bind_addr"
for i in {1..36}
do
echo -n "$i "
sleep 10
grep "http.client.RemoteDisconnected" "$logfile" && echo && kill "$gunicorn_pid" && exit 1
#curl "http://$bind_addr" # when this is enabled the problem does not appear
done
echo
kill "$gunicorn_pid"
! grep "http.client.RemoteDisconnected" "$logfile"
Also got some additional thoughts/information. Unfortunately my reproduction script is not 100% reliable but very close. Increasing the for loop count makes it more reliable at the cost of making a good pass longer. This looks like a race condition of some sort. I'm starting to wonder whenever this is the fault of the apm agent or rather the apm server. I'm also unable to reproduce this error on 4.0.3.
Furthermore the problem does not appear to manifest itself when there are other requests to the flask server (ex. every 10s) or maybe rather other transactions' requests to the apm server made by the apm agent as a result of the traffic.
Looking more closely at the wireshark logs it looks like the apm server sent the first FIN packet so from my understanding it is the server that closed the TCP connection. The HTTP protocol has a problem now because it just sent a POST request that did not receive a response at the HTTP level. If this were a GET request it would simply be re-sent but according to the HTTP spec it can't do that for POST: https://tools.ietf.org/html/rfc2616#section-8.1.4
Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2). Non-idempotent methods or sequences MUST NOT be automatically retried
It makes sense from the protocol point of view. A POST request (which is non-idempotent) just got sent and did not get a response. The request may have reached the server or not and the client has no way of knowing this. It can't just retry it as a POST may have already worked and double POST may have unintended consequences. So the behavior of urllib is correct - throw an exception rather than retry.
The question that remains - if I'm understanding this correctly - why did the apm server send the FIN packet and closed the TCP connection? My theory is (haven't looked at sources yet) that maybe the APM server looks at transactions (so metrics do not count?) rather than http requests as a condition to close or leave the TCP connection open. This would mean that the race condition is closing the TCP connection between the POST request and response.
Any thoughts how to proceed?
@reveil great analysis, thanks!
I just managed to replicate the issue on my local machine, but only if a) I put the APM Server under a lot of load with a load generator tool and b) put the gunicorn server under a lot of load (using ab
). With these two load generators running, I see the error after a couple of minutes. Seeing as your access patterns to trigger the error are completely different, it might actually be a different problem with the same symptom...
I share your suspicion that this might actually be a problem with APM Server. I asked @graphaelli, one of our APM Server devs, to investigate if there are any scenarios where the APM Server would close the connection without returning a response.
I started to look at APM server a little bit closely and tried version 7.0.0-alpha2 and tried to find something relevant in server debug logs but got nothing valuable so went back to 6.6.0.
Then I found this setting:
https://www.elastic.co/guide/en/apm/server/current/configuration-process.html#read_timeout
When changed from default 30s to:
read_timeout: 45s
I'm now unable to reproduce the issue with my script. Also there are no more FIN packets from the server in wireshark. Even those not harmful between HTTP requests.
I also noticed that the 30s may very well be racing with 30000ms from here: https://github.com/elastic/apm-agent-python/blob/v4.1.0/elasticapm/conf/__init__.py#L223
Also changing it to 20000ms makes the issue disappear even if read_timeout on the server is still 30s. Not sure if changing it is desired though.
So maybe to workaround this issue either of these setting need to be changed so that the server read_timeout is just longer than agent metrics_interval by changing either of these settings. I'm not sure if this is a proper fix but much better than the current situation.
As for reproduction on your side maybe using the docker-compose.yml for the apm server I posted above is required? It's also somehow a little bit harder to reproduce if apm server debug logging is enabled.
Also if there is some ticket on apm-server side could you please link it here?
I wonder when this problem will be fixed. My log is literally full of these exceptions.
Having the same problem in v4.1.0, although only for crons that run a Django command
Sorry for the lack of progress with this folks! Unfortunately, there are a few things with higher priority on my plate at the moment.
@fr0der1c under how much load is your APM Server when this happens? So far, I was only able to reproduce the issue when the server was under very high load.
@reveil I have a hard time understanding how the metrics collection interval has any effect on the HTTP transport. They aren't connected directly. When metrics are collected, they are put on a queue which then is processed by a thread. It only triggers a request if the metrics event causes the queue size to reach an upper limit. Also, there is currently a bug which forces a request every 10 seconds, even if nothing happened it between.
However, we are currently working on a major refactoring of the queueing mechanics (#411). With a bit of luck, it might help with this issue.
@beniwohli The real reason why this bug appears is the way the HTTP protocol works with TCP. Basically when any HTTP/1.1 (or later) request is made the server keeps the TCP socket open.
Currently the default timeout for this TCP connection on the APM Server is 30s. The client may decide to send a new HTTP POST request (any request metrics or any other) in at the same time the server decides to terminate the TCP connection. And it just so happens that metrics are sent every 30s which makes this much more likely to happen. This is a race condition of 3 things:
The error is the TCP FIN (from the server) comes after the HTTP request and before HTTP response. So if this happens in any other order the bug is not visible. Note the problem is that 3 can't happen after 2 as the socket is now closed and client can't get a response.
You can fix this in several ways:
Note that either solution 1 or 2 work only when metrics are ENABLED (regular requests more frequent than timeout keep the TCP socket open). So metrics may fix this problem instead of making it more visible (it was here before the metrics but much harder to hit).
@reveil sure, but the thing is, we don't send metrics every 30 seconds, we collect them every 30 seconds. Then they are queued up for sending, which, depending on the current queue state, can happen immediately (if the queue was almost full) or up to 10 seconds (if we just sent a request). That's what I tried to say above when I wrote that metrics collection is decoupled from sending.
If there is a race condition, I think it is more probable to come from the 10s timer to send requests, which lines up with the server timeout every 3rd time. What we could do here is to add random(-1, 1)
to the timer, which would make it less common for them to sync up exactly.
@beniwohli Ok so they are decoupled. But what if the queue is empty and there are no other requests? Then after 30s they are collected and immediately sent right? So the net result is they are sent every 30s in this scenario.
@beniwohli Also note the root cause of the problem is not the metrics sending. You can just as well have transactions at regular intervals (common in APIs) and trigger this bug. Someone managed to do it on 4.0.3 so it is possible without metrics.
The problem is that there exists a scenario when the TCP connection can be closed at all. The best solution is to never allow it to happen as when the TCP FIN is sent you can't guarantee that a HTTP POST is not send at the same time which is itself a race condition. A workaround is sending an extra GET/HEAD just before POST to make sure the TCP socket is open at the time of POST.
@reveil not at the moment, due to a bug that causes a request every 10s even if the queue is empty (among a lot of other things, that bug is fixed in #411)
Sending a HEAD
request before the POST
sounds interesting. I'll look into that.
@beniwohli Is this bug of sending every 10s just on master or also in 4.1.0? Probably not in 4.1.0.
Actually this behaviour you describe would fix this #409. The reason is a request every 10s would keep the TCP socket open. If you tried to reproduce this bug with an extra request every 10s then it is impossible to trigger. Actually this bug may actually be a fix. Maybe consider keeping it as a feature?
The best way to send the HEAD is to have a timer on the client/agent so that you send it only when ex. 20s has passed since the last request. But that can of course come as a future improvement as fixing this is probably a priority.
That bug is in there since 4.0, so it doesn't help with #409 I'm afraid
@beniwohli Very strange I definitely got a metrics request exactly every 30s when there is no other transactions. You can see proof on the wireshark packet capture above. You can see 30s then 60s in the time column.
@beniwohli I probably misunderstood the 10s bug you described. Which one it is:
I initially assumed 1 but if 2 is really the case then after a couple first metrics sent sending will still happen in 30s intervals just moved 10s from the collection. This has no effect on this bug as only the network and HTTP and TCP protocols are involved here and it does not matter when these were collected.
@reveil can you curl your http://yourappserver:port/ and paste the response content
@reveil @ciarancourtney @fr0der1c sorry for the silence here, I was on PTO last week. In the mean time, @simitt was investigating this issue from the server side, and a possible fix could be to use an idle timeout instead of a read timeout. Silvia also proposed to set SO_KEEPALIVE
on the underlying TCP socket, which would send a ping over the TCP connection every so often to, well, keep it alive.
I've implemented this in #422. Would it be possible for you folks to give this branch a try and see if it fixes the issue for you?
I'm also experiencing this issue with apm client embedded in a python flask API server. ElasticAPM Server is running as a 4 node daemonset on kubernetes exposed as a NodePort Service. APM server is under NO load as this was deployed specifically for the API server.
If i can contribute any additional information let me know! looking forward to a fix for this so i can play around with APM for this deployment.
@MrBones757 if you could try the branch from #422 in your environment to see if it fixes the issue, that would be super helpful. You can install the branch e.g. via pip:
pip install "https://github.com/beniwohli/apm-agent-python/archive/tcp-keepalive.zip#egg=elastic-apm"
@beniwohli I applied the patch manually and didn't notice any difference since yesterday for my cron process that calls into Django, but would I need to do a full install like above to burst the pycache or given that its a cron process that would be redundant?
@ciarancourtney I don't think the bytecode cache would be an issue in this case, so that would unfortunately be an indication that #422 doesn't help fixing the problem :(
Just did a full install as above to confirm. Note you may need to pip install -U
to overwrite the existing install
Nope, just got an error 10min ago :(
I was able to resolve this issue for now on the current stable version. I updated my APM Server config to be the following:
I have not observed any issues to date, though im working on deploying it to my environment to test under alittle more load. Given my app is still in development though i'm not sure ill be able to replicate appropriate load
''' apm-server: read_timeout: 45s write_timeout: 45s max_connections: 0 host: 0.0.0.0:8200 rum: enabled: true output: elasticsearch: enabled: true hosts:
Some of @MrBones757 settings worked for me too, specifically:
apm-server:
read_timeout: 45s
write_timeout: 45s
max_connections: 0
@ciarancourtney @MrBones757 sorry to bug you folks again. I've implemented another approach that might work: https://github.com/elastic/apm-agent-python/pull/426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.
pip install "https://github.com/beniwohli/apm-agent-python/archive/fuzzy-api-request-time.zip#egg=elastic-apm"
@ciarancourtney @MrBones757 sorry to bug you folks again. I've implemented another approach that might work: #426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.
pip install "https://github.com/beniwohli/apm-agent-python/archive/fuzzy-api-request-time.zip#egg=elastic-apm"
i was getting around 15 messages with this error per hour but your solution seems to work for me, in last hour i did not receive any of this error, thx
@ciarancourtney @MrBones757 sorry to bug you folks again. I've implemented another approach that might work: #426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.
pip install "https://github.com/beniwohli/apm-agent-python/archive/fuzzy-api-request-time.zip#egg=elastic-apm"
Yep, seems fine for last 7hrs (reverted my apm server conf back to defaults)
Sorry for the late response.
I was not able to reproduce the issue with my script running for 1h on fuzzy-api-request-time branch. Usually the error appeared after about 5 minutes. I'm not sure this is a proper solution though compared to setting read_timeout: 45s on the apm server. As a workaround it is good - especially if run with server default settings.
To validate it properly I will try to debug it with wireshark and try to find FIN packets from the server. If there are any there is a chance of this bug to trigger (can't estimate how big - probably same as on 4.0.3 - remember someone got this on 4.0.3). Setting larger read timeout on the server prevented the FIN TCP packets from appearing - this is guarantee this bug will not happen. I will let you know later if this branch also prevents FIN packets or not.
Awesome, thanks for the update @reveil, and a huge thanks for all the time you already invested in investigating his issue!
@beniwohli Unfortunately FIN packets from the server are still present You can see a TCP FIN originating from 8200 followed by a SYN from the client. While this is not a problem in this case (as FIN arrived after HTTP response) there is still a risk (although much smaller) that the FIN packet may end up between HTTP request and response and cause this bug.
I would advise fixing defaults on the server side as well. I have been running in production with read_timeout: 45s for a month now and have not seen any issues since.
@reveil awesome, thanks! As far as I understand, a FIN
per se isn't a problem, if it is only sent after the HTTP response was received. In this case, urllib3 will recognize that the connection was closed and recycle it.
I will go ahead and merge #426, even if it isn't a 100% fix for the problem. It seems to at least substantially reduce the number of connection issues. It'll take some time until a server side fix can be released (currently slated for 7.1, which is some time out).
Thaks for the change & apologies for the slow reply. I can confirm the changes also worked for me in my environment. Much appreciated!
Thanks again all for your help debugging this! I just released version 4.2.0 of the agent which contains #426 and a bunch of other goodies.
I'm running version 4.2.2 of elastic-apm with apm-server 6.7.1 and I'm experiencing that.
@seocam ugh, sorry to hear :( how often do you see the error?
We've installed the APM on our servers yesterday and got the error 6 times on Sentry since then. Now that I'm checking I could realize that so far it only affected Celery servers. Since yesterday we had around 600 tasks processed so it's about 1%.
@seocam Could you try running the APM server with a 45s read_timeout? This is perfect workaround for me at least until an APM server version containing the fix: https://github.com/elastic/apm-server/pull/2122 is released.
Just set. I'll let you know the outcomes! Thanks
Describe the bug: ... Events getting reported to APM-Server are randomly getting rejected. It appears that every so often the event getting reported is being met with a RemoteDisconnected exception.
To Reproduce
Expected behavior: ... No Exception
Environment (please complete the following information)
python:3
docker containerAdditional context
downgrading to 4.0.3 remedies the issue.
Directly related: https://discuss.elastic.co/t/unable-to-reach-apm-server-connection-aborted-remotedisconnected-remote-end-closed-connection-without-response/166500
Agent config options
Click to expand
``` ELASTIC_APM = { 'SERVICE_NAME': 'web-backend', 'SECRET_TOKEN': '8149e743-2524-2524-2524-94aadd3fcde7', 'SERVER_URL': 'http://192.168.16.100:8200', 'DEBUG': True, } ```requirements.txt
:Click to expand
``` django-allauth django-guardian django-qr-code django-icons django-phonenumber-field django-crispy-forms django-autocomplete-light django-money django-anymail django-debug-toolbar django-extensions django-countries djangorestframework psycopg2-binary stripe geoip2 redis celery sentry-sdk googlemaps django pillow phonenumbers gunicorn whitenoise elastic-apm ```