Open sliwinski-milosz opened 2 years ago
Routing to @getsentry/owners-ingest for triage. ⏲️
I believe this is a longer-standing issue with the Python SDK, but the web platform folks may know if we have similar issues in other SDKs (in that case ingest-team should investigate it)
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog
or Status: In Progress
, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀
I think this issue is a major one and has to be fixed in the near future.
Are there any updates related to this specific issue? The idea shared here was to expose socket_options.
@sl0thentr0py could you please tell me if that is still preferable way to approach this issue or maybe you have some other ideas? :)
closing in favor of https://github.com/getsentry/sentry-python/issues/2391
we encountered similar issues while trying out the new metrics stuff in prod so i'm reopening
exception: "Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
rv = real_getresponse(self, *args, **kwargs)
File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/sentry_sdk/transport.py", line 532, in send_envelope_wrapper
self._send_envelope(envelope)
File "/usr/local/lib/python3.8/site-packages/sentry_sdk/transport.py", line 431, in _send_envelope
self._send_request(
File "/usr/src/sentry/src/sentry/utils/sdk.py", line 229, in patched_send_request
return _send_request(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/sentry_sdk/transport.py", line 242, in _send_request
response = self._pool.request(
File "/usr/local/lib/python3.8/site-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/usr/local/lib/python3.8/site-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.8/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
rv = real_getresponse(self, *args, **kwargs)
File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.8/http/client.py", line 277, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))"
ok so first thing, our setting num_pools
to 2 in the PoolManager
is entirely useless because the pool is per host and for a single DSN we will always use the same connection pool.
(venv) ➜ vanilla git:(main) ✗ python app.py
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
using conn HTTPSConnectionPool(host='o447951.ingest.sentry.io', port=443)
maxsize
will not help since the worker only picks up one request at a time so we're always basically using the same connection from the same connection poolStarting new HTTPS connection (1): o447951.ingest.sentry.io:443
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
[https://o447951.ingest.sentry.io:443](https://o447951.ingest.sentry.io/) "POST /api/5461230/envelope/ HTTP/1.1" 200 41
put back conn: <urllib3.connection.HTTPSConnection object at 0x105f77220>
SO_KEEPALIVE
for the socket as per this and try out in prodI tried the scenario
do some stuff, wait 15 min (more than LB timeout), do some stuff again
and in this case, the connection RESETS GRACEFULLY as seen in the logs
https://o447951.ingest.sentry.io:443 "POST /api/5461230/envelope/ HTTP/1.1" 200 0
[sentry] DEBUG: Sending envelope [envelope with 2 items (error, attachment)] project:5461230 host:o447951.ingest.sentry.io
https://o447951.ingest.sentry.io:443 "POST /api/5461230/envelope/ HTTP/1.1" 200 0
[sentry] DEBUG: Flushing HTTP transport
[sentry] DEBUG: background worker got flush request
[sentry] DEBUG: Sending envelope [envelope with 2 items (error, attachment)] project:5461230 host:o447951.ingest.sentry.io
Resetting dropped connection: o447951.ingest.sentry.io. <--------------------
[sentry] DEBUG: 11 event(s) pending on flush
https://o447951.ingest.sentry.io:443 "POST /api/5461230/envelope/ HTTP/1.1" 200 0
Hello!
It is great to see that you are looking into this issue!
This is the workaround that I am currently using:
import socket
from sentry_sdk import transport
from sentry_sdk.transport import HttpTransport
from urllib3.connection import HTTPConnection
class KeepAliveHttpTransport(HttpTransport):
def _get_pool_options(self, ca_certs):
opts = super()._get_pool_options(ca_certs)
opts['socket_options'] = HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.SOL_TCP, socket.TCP_KEEPIDLE, 45),
(socket.SOL_TCP, socket.TCP_KEEPINTVL, 10),
(socket.SOL_TCP, socket.TCP_KEEPCNT, 6),
]
return opts
def fix_connection_resets_on_sentry():
transport.HttpTransport = KeepAliveHttpTransport
So basically I am injecting the socket_options
into the sentry internals. Of course I discourage of using that approach as it relies on sentry internals and may stop to work at any time.
Yet I am sending it to let you know that with above workaround I reached the acceptable reliability level - meaning that so far I have not found any missing event. However note that I disabled the sentry debug logs already and I am not looking very closely into that anymore. Which means there could be some that I missed - I guess I will know once I will start to use those new metrics that you mentioned about.
Long story short - there is a chance that updating the socket_options will improve the reliability significantly 👍
But there is also a bad news. I have been investigating the RemoteDisconnected
errors for the calls between the Python services.
I have not monitored the same for the calls between services and sentry server - but I guess it is affected as well.
When we use requests Session object for handling calls, above RemoteDisconnected
error will happen from time to time and that cannot be fixed.
I know this project uses urllib3
for calls but if it uses persisted connections it is affected as well (correct me if I am wrong).
From my observation this problem happens once every 200k-300k calls -> which does not make it a major issue. Yet still I am letting you know so you are aware of the problem. It is currently impossible to achieve high level of reliability while using persisted connections.
A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the “idle” connection. From the server’s point of view, the connection is being closed while it was idle, but from the client’s point of view, a request is in progress. This means that clients, servers, and proxies MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2). Non-idempotent methods or sequences MUST NOT be automatically retried, although user agents MAY offer a human operator the choice of retrying the request(s). Confirmation by user-agent software with semantic understanding of the application MAY substitute for user confirmation. The automatic retry SHOULD NOT be repeated if the second sequence of requests fails.
There are 2 ways to mitigate the problem:
More details: Explanation Why it cannot be retried for idempotent requests
ty @sliwinski-milosz for the detailed explanation! Most of what you said is consistent with my research so far. Retries are not an option for us as well. We will increase throughput through the SDK with the new metrics offering so we need to make sure there is a certain level of reliability.
Hey folks, we've just released SDK version 1.41.0 which includes support for setting socket options directly via an init
option without the need for a custom transport, so the workaround above can now be written as:
import socket
from urllib3.connection import HTTPConnection
import sentry_sdk
sentry_sdk.init(
# ...your usual settings...
socket_options=HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
# note: skip the following line if you're on MacOS since TCP_KEEPIDLE doesn't exist there
(socket.SOL_TCP, socket.TCP_KEEPIDLE, 45),
(socket.SOL_TCP, socket.TCP_KEEPINTVL, 10),
(socket.SOL_TCP, socket.TCP_KEEPCNT, 6),
],
)
Environment
SaaS (https://sentry.io/)
Version
No response
Steps to Reproduce
I tried to debug this issue however I am not 100% sure what is the root cause. I think the problem is related to: https://github.com/urllib3/urllib3/issues/944#issuecomment-237471302 I have applied following workaround and it fixed the problem: https://github.com/psf/requests/issues/4937#issuecomment-788899804
It seems that for the Python client, the keep alive option is disabled by default. Maybe this is causing connection problems. Even that this looks like a Python client issue, and that the workaround on the client side worked fine - I wonder if there is a way to fix this on the server side. This is why I am opening the issue here.
Maybe changing the idle connection timeout on the server side would help. I would appreciate your expertise on that part.
It is not easy to reproduce this error. I tried to reproduce it from local machine by sending events every x seconds (1, 10, 60, 300, 1800). I have reproduced it only twice using 1800 secs delay between two events. However often, even for 1800 it was working fine. After enabling DEBUG logs on client side, I have realised that this issue happens few times a day for my service. I am happy to gather some more logs if needed.
Below I am sending full Python traceback logs for the issue:
Dependencies versions on client side:
Expected Result
Client is able to successfully send all events to the server.
Actual Result
Due to connectivity problems, some events fail on the client side.