ConnectTimeoutError being raised in autopollingcachepolicy.py

joshzana commented 3 years ago

We (Falkon AI) are a new user and run a very low-traffic service with around 6 pods in AWS EKS clusters. We use FastAPI on Python 3.8.6. Each pod has Configcat 5.0.0 set up using the default caching policies. Our logging and monitoring systems are showing occasional blips of time when Configcat seems unreachable, but https://status.configcat.com does not show any outage.

Some specific timestamps where we've seen events are:

12/29/2020 06:59:52 UTC
12/28/2020 19:49:29 UTC
12/28/2020 02:56:23 UTC
12/27/2020 15:46:02 UTC
12/25/2020 18:41:13 UTC

When this happens, we get a spew of 4 exceptions like the following:

click to expand

``` [2020-12-29 06:59:52,920] {autopollingcachepolicy.py:97} ERROR - Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn conn = connection.create_connection( File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection raise err File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection sock.connect(sa) socket.timeout: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 381, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn conn.connect() File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 309, in connect conn = self._new_conn() File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 164, in _new_conn raise ConnectTimeoutError( urllib3.exceptions.ConnectTimeoutError: (, 'Connection to cdn-global.configcat.com timed out. (connect timeout=10)') During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 439, in send resp = conn.urlopen( File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen retries = retries.increment( File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 446, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-global.configcat.com', port=443): Max retries exceeded with url: /configuration-files/REDACTED/config_v5.json (Caused by ConnectTimeoutError(, 'Connection to cdn-global.configcat.com timed out. (connect timeout=10)')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/configcatclient/autopollingcachepolicy.py", line 73, in force_refresh configuration_response = self._config_fetcher.get_configuration_json(force_fetch) File "/usr/local/lib/python3.8/site-packages/configcatclient/configfetcher.py", line 81, in get_configuration_json response = requests.get(uri, headers=headers, timeout=(10, 30), File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 76, in get return request('get', url, params=params, **kwargs) File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 530, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 643, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 504, in send raise ConnectTimeout(e, request=request) requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='cdn-global.configcat.com', port=443): Max retries exceeded with url: /configuration-files/REDACTED/config_v5.json (Caused by ConnectTimeoutError(, 'Connection to cdn-global.configcat.com timed out. (connect timeout=10)')) ```

Questions on this:

Do those time periods line up with any partial Configcat outages or are they just random connectivity issues?
Is there anything unique to our setup that could be causing this kind of exception or does this happen to everyone?
Is there anything you can do to let users tone down the huge amount of logs on a connect timeout? I don't want to just disable all of configcat's logging, but log.exception call seems like the culprit: https://github.com/configcat/python-sdk/blob/master/configcatclient/autopollingcachepolicy.py#L97

kp-cat commented 3 years ago

Hi,

How do you create/use the ConfigCat Client?

We strongly recommend using the ConfigCat Client as a Singleton object in your application. You may experience a similar issue on your local system if you evaluate config values frequently and init ConfigCat every time when you evaluate values because ConfigCat can't build and use local cache.

joshzana commented 3 years ago

Hi thanks for getting back to me!

We use a singleton approach. We have a wrapper with this code, which is called once at process startup:

CONFIGCAT_CLIENT = None

def initialize(config: Config = Config()):
    if config.CONFIGCAT_API_KEY:
        global CONFIGCAT_CLIENT
        assert (
            CONFIGCAT_CLIENT is None
        ), "Illegal attempt to reinitialize feature flagging"
        CONFIGCAT_CLIENT = configcatclient.create_client(config.CONFIGCAT_API_KEY)

And we then use it like this:

def get_flag_value_for_user(flag: FeatureFlag, user: User) -> bool:
    if CONFIGCAT_CLIENT:
        configcat_user = ConfigcatUser(identifier=user.id, email=user.email)
        return bool(CONFIGCAT_CLIENT.get_value(flag.name, False, configcat_user))
    else:
        return False

kp-cat commented 3 years ago

Thanks for the code. This singleton approach seems good. On these periods you mentioned we don't see any degradation in our services.

Could you please share some more details? In which location do you experience this exceptions? Do you have timeout if you simple try to call the cdn url https://cdn-global.configcat.com/configuration-files/{your sdk api key}/config_v5.json?

joshzana commented 3 years ago

Our location is AWS, US-West-2.

Note that we've made about 2 million successful requests to download this config json in the last few weeks, and only failed on <100 of them, so maybe this is below your threshold for monitoring?

In terms of hitting the CDN url directly, I don't know, since I don't have a way to consistent reproduce this happening. My assumption based on the stack is that if I tried that during the time periods when we hit timeouts, it would also time out.

kp-cat commented 3 years ago

Last time we met a similar timeout issue like this we found that the root cause was that the client has many HTTP resources and hits the physical limit of the machine. In this case, we cannot see the issue on our side.

In your case, in default auto polling mode the config will be updated in every 60 seconds and the get_value should fetch the data from the in-memory cache. It shouldn't be the problem.

Are you calling force_refresh in your code?

Would it help you if we handle the timeout exception here to generate less log and show a single line error message instead of a long exception trace?

We have a troubleshooting page: https://test.configcat.com/docs/advanced/troubleshooting Maybe going through the general SDK checklist would help.

josh-boehm commented 3 years ago

force_refresh is not being called in our Falkon code.

As far as things that might be helpful for us - fundamentally I think we don't really care if a single refresh of the config fails at the relatively low frequency these failures seem to be at, but we will care if several in a row fail. In general we see these as a few second blip, generally lasting less than a minute so given we only refresh every 60 seconds we really just need a retry option.

Two proposals that could work for us: 1) Let us supply a number_of_errors_to_raise or a max_time_without_refresh argument when setting up the auto-refresh client with the idea being we could request that the client only throw an exception if it fails to refresh N times in a row (or after X seconds depending on what you prefer). If the system self heals on the second attempt its all good and we don't need to get an error at all.

2) Going a level deeper - it looks like the code here: https://github.com/configcat/python-sdk/blob/2baad5ed9594140584a5cf7da6eaa2d5d3a0915a/configcatclient/configfetcher.py#L70

doesn't have any retry logic for issues like connection timeout or other nominally retry-able errors. If it auto-retried certain errors or even if we could tell it to just do exponential back-off/some max number of retries before raising that would probably also achieve the same desire.

kp-cat commented 3 years ago

Hey @josh-boehm,

We would like to avoid ignoring exceptions/errors in our SDKs. A solution can be if you register an own logger before importing configcatclient you can filtering the exceptions in the logger. Maybe something similar can help:

import logging
from urllib3.exceptions import ConnectTimeoutError
import sys

# Setting the log level to Info to show detailed feature flag evaluation.
logging.basicConfig(level=logging.INFO)

class InternalLogger(logging.Logger):
    MAX_EXCEPTION_COUNT = 3

    def __init__(self, name, level = logging.NOTSET):
        self._exception_count = 0
        return super(InternalLogger, self).__init__(name, level)

    def exception(self, msg, *args, exc_info=True, **kwargs):
        if self.name == 'configcatclient.autopollingcachepolicy' and sys.exc_info()[0] is ConnectTimeoutError:
            if self._exception_count < InternalLogger.MAX_EXCEPTION_COUNT:
                self._exception_count += 1
                return  # ignore exception
            self._exception_count = 0

        return super(InternalLogger, self).exception(msg, *args, exc_info=exc_info, **kwargs)

logging.setLoggerClass(InternalLogger)

import configcatclient

if __name__ == '__main__':
    # Initialize the ConfigCatClient with an SDK Key.
    client = configcatclient.create_client('<sdk_key>')

configcat / python-sdk

ConnectTimeoutError being raised in autopollingcachepolicy.py #20