Closed joshzana closed 3 years ago
Hi,
How do you create/use the ConfigCat Client?
We strongly recommend using the ConfigCat Client as a Singleton object in your application. You may experience a similar issue on your local system if you evaluate config values frequently and init ConfigCat every time when you evaluate values because ConfigCat can't build and use local cache.
Hi thanks for getting back to me!
We use a singleton approach. We have a wrapper with this code, which is called once at process startup:
CONFIGCAT_CLIENT = None
def initialize(config: Config = Config()):
if config.CONFIGCAT_API_KEY:
global CONFIGCAT_CLIENT
assert (
CONFIGCAT_CLIENT is None
), "Illegal attempt to reinitialize feature flagging"
CONFIGCAT_CLIENT = configcatclient.create_client(config.CONFIGCAT_API_KEY)
And we then use it like this:
def get_flag_value_for_user(flag: FeatureFlag, user: User) -> bool:
if CONFIGCAT_CLIENT:
configcat_user = ConfigcatUser(identifier=user.id, email=user.email)
return bool(CONFIGCAT_CLIENT.get_value(flag.name, False, configcat_user))
else:
return False
Thanks for the code. This singleton approach seems good. On these periods you mentioned we don't see any degradation in our services.
Could you please share some more details?
In which location do you experience this exceptions?
Do you have timeout if you simple try to call the cdn url https://cdn-global.configcat.com/configuration-files/{your sdk api key}/config_v5.json
?
Our location is AWS, US-West-2.
Note that we've made about 2 million successful requests to download this config json in the last few weeks, and only failed on <100 of them, so maybe this is below your threshold for monitoring?
In terms of hitting the CDN url directly, I don't know, since I don't have a way to consistent reproduce this happening. My assumption based on the stack is that if I tried that during the time periods when we hit timeouts, it would also time out.
Last time we met a similar timeout issue like this we found that the root cause was that the client has many HTTP resources and hits the physical limit of the machine. In this case, we cannot see the issue on our side.
In your case, in default auto polling mode the config will be updated in every 60 seconds and the get_value should fetch the data from the in-memory cache. It shouldn't be the problem.
Are you calling force_refresh in your code?
Would it help you if we handle the timeout exception here to generate less log and show a single line error message instead of a long exception trace?
We have a troubleshooting page: https://test.configcat.com/docs/advanced/troubleshooting Maybe going through the general SDK checklist would help.
force_refresh is not being called in our Falkon code.
As far as things that might be helpful for us - fundamentally I think we don't really care if a single refresh of the config fails at the relatively low frequency these failures seem to be at, but we will care if several in a row fail. In general we see these as a few second blip, generally lasting less than a minute so given we only refresh every 60 seconds we really just need a retry option.
Two proposals that could work for us: 1) Let us supply a number_of_errors_to_raise or a max_time_without_refresh argument when setting up the auto-refresh client with the idea being we could request that the client only throw an exception if it fails to refresh N times in a row (or after X seconds depending on what you prefer). If the system self heals on the second attempt its all good and we don't need to get an error at all.
2) Going a level deeper - it looks like the code here: https://github.com/configcat/python-sdk/blob/2baad5ed9594140584a5cf7da6eaa2d5d3a0915a/configcatclient/configfetcher.py#L70
doesn't have any retry logic for issues like connection timeout or other nominally retry-able errors. If it auto-retried certain errors or even if we could tell it to just do exponential back-off/some max number of retries before raising that would probably also achieve the same desire.
Hey @josh-boehm,
We would like to avoid ignoring exceptions/errors in our SDKs. A solution can be if you register an own logger before importing configcatclient you can filtering the exceptions in the logger. Maybe something similar can help:
import logging
from urllib3.exceptions import ConnectTimeoutError
import sys
# Setting the log level to Info to show detailed feature flag evaluation.
logging.basicConfig(level=logging.INFO)
class InternalLogger(logging.Logger):
MAX_EXCEPTION_COUNT = 3
def __init__(self, name, level = logging.NOTSET):
self._exception_count = 0
return super(InternalLogger, self).__init__(name, level)
def exception(self, msg, *args, exc_info=True, **kwargs):
if self.name == 'configcatclient.autopollingcachepolicy' and sys.exc_info()[0] is ConnectTimeoutError:
if self._exception_count < InternalLogger.MAX_EXCEPTION_COUNT:
self._exception_count += 1
return # ignore exception
self._exception_count = 0
return super(InternalLogger, self).exception(msg, *args, exc_info=exc_info, **kwargs)
logging.setLoggerClass(InternalLogger)
import configcatclient
if __name__ == '__main__':
# Initialize the ConfigCatClient with an SDK Key.
client = configcatclient.create_client('<sdk_key>')
We (Falkon AI) are a new user and run a very low-traffic service with around 6 pods in AWS EKS clusters. We use FastAPI on Python 3.8.6. Each pod has Configcat 5.0.0 set up using the default caching policies. Our logging and monitoring systems are showing occasional blips of time when Configcat seems unreachable, but https://status.configcat.com does not show any outage.
Some specific timestamps where we've seen events are:
When this happens, we get a spew of 4 exceptions like the following:
click to expand
``` [2020-12-29 06:59:52,920] {autopollingcachepolicy.py:97} ERROR -Questions on this: