Closed exceptorr closed 4 years ago
1. There is some time (~1 min) between the ConfigMap change (read: after the juju applies the new pod spec so ConfigMap is actually changed) and the new config file being actually rendered on the filesystem. Therefore, calling `/-/reload` in the `config-changed` hook immediately does not make any sense, since new config has not arrived. Still need to think about how can we wait for the new config file or make the process faster (e.g why the config file arrives with such noticeable delay? Do we have any other option to deliver the config file to the service container?)
I wonder if it has something to do with this: https://kubernetes.io/docs/concepts/configuration/configmap/#mounted-configmaps-are-updated-automatically
If the delay is caused by some setting in the underlying k8s cluster, then we'll have take that into consideration and maybe keep polling and somehow find out that the file has finally updated before telling prometheus to reload.
2. Failing unit test:
______________________________________________ OnConfigChangedHandlerTest.test__it_blocks_until_pod_is_ready ______________________________________________ self = <charm_test.OnConfigChangedHandlerTest testMethod=test__it_blocks_until_pod_is_ready> mock_stored_state_cls = <MagicMock name='StoredState' spec_set='StoredState' id='4358979408'>, mock_pod_spec = <function set_juju_pod_spec at 0x103d3b560> mock_juju_pod_spec = <function build_juju_pod_spec at 0x103d5ae60>, mock_time = <NonCallableMagicMock name='time' spec_set='module' id='4358998224'> mock_k8s_mod = <NonCallableMagicMock name='k8s' spec_set='module' id='4359109776'> mock_build_juju_unit_status_func = <function build_juju_unit_status at 0x103d5add0> @patch('charm.build_juju_unit_status', spec_set=True, autospec=True) @patch('charm.k8s', spec_set=True, autospec=True) @patch('charm.time', spec_set=True, autospec=True) @patch('charm.build_juju_pod_spec', spec_set=True, autospec=True) @patch('charm.set_juju_pod_spec', spec_set=True, autospec=True) @patch('charm.StoredState', spec_set=True, autospec=True) def test__it_blocks_until_pod_is_ready( self, mock_stored_state_cls, mock_pod_spec, mock_juju_pod_spec, mock_time, mock_k8s_mod, mock_build_juju_unit_status_func): # Setup mock_fw_adapter_cls = \ create_autospec(framework.FrameworkAdapter, spec_set=True) mock_fw_adapter = mock_fw_adapter_cls.return_value mock_juju_unit_states = [ MaintenanceStatus(str(uuid4())), MaintenanceStatus(str(uuid4())), ActiveStatus(str(uuid4())), ] mock_build_juju_unit_status_func.side_effect = mock_juju_unit_states mock_event_cls = create_autospec(EventBase, spec_set=True) mock_event = mock_event_cls.return_value mock_state = mock_stored_state_cls.return_value > mock_state.set_default(is_started=False) test/charm_test.py:113: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <NonCallableMagicMock name='StoredState()' spec_set='StoredState' id='4355493072'>, name = 'set_default' def __getattr__(self, name): if name in {'_mock_methods', '_mock_unsafe'}: raise AttributeError(name) elif self._mock_methods is not None: if name not in self._mock_methods or name in _all_magics: > raise AttributeError("Mock object has no attribute %r" % name) E AttributeError: Mock object has no attribute 'set_default' /usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/mock.py:598: AttributeError
This is due to the fact that framework.StoredState
doesn't statically define a set_default
method. Rather it uses some metaprogramming to set it which @patch
doesn't see so it's telling you the method doesn't exist in the mocked object. You could spec_set=False
but that runs the risk of your mock's method signatures going out of sync with the methods of the real object.
Still in progress and missing some unit tests, but at least now it waits until the ConfigMap will be applied to the pod.
@relaxdiego mind reviewing, while I'm working on the tests?
I modified your code slightly to use event.defer()
as follows:
if not state.is_started:
state.is_started = True
else:
logging.debug("Call HTTP config reload")
try:
fw_adapter.set_unit_status(MaintenanceStatus(
"Waiting for config parameters propagation"
))
expected_config = build_prometheus_config(fw_adapter.get_config())
logging.debug(
"Awaiting Prom config to become: {0}".format(expected_config)
)
# Wait, until ConfigMap changes will be propagated
prometheus_config_reload(juju_model, juju_app)
new_config_applied = prometheus_ensure_config_propagated(
juju_model, juju_app, expected_config
)
if not new_config_applied:
logger.debug("**** Config not propagated. DEFERRING")
logger.debug("**** Config not propagated. DEFERRING")
logger.debug("**** Config not propagated. DEFERRING")
event.defer()
else:
logging.debug("Config reloaded")
fw_adapter.set_unit_status(ActiveStatus())
except CharmError as e:
error_msg = "Config reload error: {0}".format(e)
logging.error(error_msg)
fw_adapter.set_unit_status(BlockedStatus(error_msg))
It works BUT there is (AT LEAST) a 3 minute delay:
application-prometheus: 22:47:10 DEBUG unit.prometheus/0.juju-log **** Config not propagated. DEFERRING
application-prometheus: 22:47:10 DEBUG unit.prometheus/0.juju-log **** Config not propagated. DEFERRING
application-prometheus: 22:47:10 DEBUG unit.prometheus/0.juju-log **** Config not propagated. DEFERRING
application-prometheus: 22:50:34 DEBUG unit.prometheus/0.juju-log Building Juju pod spec
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Build prom config: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Rendered CLI args: --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --web.enable-lifecycle --web.console.templates=/usr/share/prometheus/consoles --web.console.libraries=/usr/share/prometheus/console_libraries --log.level=info --web.page-title="Charmed LMA - Prometheus TSDB, Collection & Processing" --web.max-connections=512 --storage.tsdb.retention.time=15d --alertmanager.notification-queue-capacity=10000 --alertmanager.timeout=10s
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Configuring pod: set PodSpec to: {'containers': [{'name': 'prometheus', 'imageDetails': {'imagePath': 'prom/prometheus:v2.18.1', 'username': '', 'password': ''}, 'args': ['--config.file=/etc/prometheus/prometheus.yml', '--storage.tsdb.path=/prometheus', '--web.enable-lifecycle', '--web.console.templates=/usr/share/prometheus/consoles', '--web.console.libraries=/usr/share/prometheus/console_libraries', '--log.level=info', '--web.page-title="Charmed LMA - Prometheus TSDB, Collection & Processing"', '--web.max-connections=512', '--storage.tsdb.retention.time=15d', '--alertmanager.notification-queue-capacity=10000', '--alertmanager.timeout=10s'], 'ports': [{'containerPort': 9090, 'protocol': 'TCP'}], 'readinessProbe': {'httpGet': {'path': '/-/ready', 'port': 9090}, 'initialDelaySeconds': 10, 'timeoutSeconds': 30}, 'livenessProbe': {'httpGet': {'path': '/-/healthy', 'port': 9090}, 'initialDelaySeconds': 30, 'timeoutSeconds': 30}, 'files': [{'name': 'config', 'mountPath': '/etc/prometheus', 'files': {'prometheus.yml': "alerting: {}\nglobal:\n evaluation_interval: 1m\n external_labels: {}\n scrape_interval: 100s\n scrape_timeout: 10s\nscrape_configs:\n- honor_timestamps: true\n job_name: prometheus\n metrics_path: /metrics\n scheme: http\n scrape_interval: 5s\n scrape_timeout: 5s\n static_configs:\n - targets: ['localhost:9090']\n"}}]}]}
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Checking k8s pod readiness
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log GET kubernetes.default.svc//api/v1/namespaces/lma/pods?labelSelector=juju-app=prometheus
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Received k8s pod status: <adapters.k8s.PodStatus object at 0x7fed2628a4a8>
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Built unit status: ActiveStatus('')
application-prometheus: 22:50:36 DEBUG unit.prometheus/0.juju-log Call HTTP config reload
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Build prom config: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Awaiting Prom config to become: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Calling Prom API: POST /-/reload
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Calling Prom API: GET /api/v1/status/config
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Expected: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log No external_labels received, adding empty key
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log No alerting received, adding empty key
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Received from API: {'global': {'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m', 'external_labels': {}}, 'scrape_configs': [{'job_name': 'prometheus', 'honor_timestamps': True, 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Config reloaded
3 minutes is not ideal so I will try the sidecar container approach for the rest of today.
Referencing blocker to this PR https://bugs.launchpad.net/juju/+bug/1882976
Status so far: I tried using JujuPodSpec V2 as described here because it had the feature where you can independently define a ConfigMap resource. I was hoping to use that in both the main container and the sidecar container. HOWEVER there seems to be no way to mount that independent ConfigMap to any of the containers. At least, that's what I can tell from the V2 specs code. Trying a different approach.
The sidecar container approach is not worth the effort and is too much of a hack. Abandoning that. I would suggest just adding an action that tells prometheus to reload the config.
@exceptorr just so you know I was able to get a k8s Job to call /-/reload
. Below is the manifest. For now I've just been applying that to k8s using microk8s.kubectl apply -f <path-to-below-manifest>
but a tiny modification to the k8s adapter can replicate that manual process.
apiVersion: v1
kind: ConfigMap
metadata:
name: expected-config-yaml
data:
expected-config.yml: |
---
global:
scrape_interval: '1m'
scrape_timeout: '10s'
evaluation_interval: '1m'
scrape_configs:
- job_name: 'prometheus'
honor_timestamps: True
scrape_interval: '5s'
scrape_timeout: '5s'
metrics_path: '/metrics'
scheme: 'http'
static_configs:
- targets:
- 'localhost:9090'
- job_name: 'some-other-service'
honor_timestamps: True
scrape_interval: '5s'
scrape_timeout: '5s'
metrics_path: '/metrics'
scheme: 'http'
static_configs:
- targets:
- 'alertmanager:9093'
---
apiVersion: v1
kind: ConfigMap
metadata:
name: reload-config-script
data:
reload-config.py: |
import http.client
import json
import logging
import sys
import yaml
import time
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logger = logging.getLogger()
def get_current_config():
logger.debug("Getting current Prometheus config")
conn = http.client.HTTPConnection('prometheus:9090')
conn.request('GET', '/api/v1/status/config')
response = conn.getresponse()
logger.debug("Get response.status={}".format(response.status))
if response.status < 200 or response.status >= 300:
logger.error("Failed to get config from Prometheus")
sys.exit(1)
payload = json.loads(response.read())
return yaml.safe_load(payload['data']['yaml'])
current_config = get_current_config()
logger.debug(current_config)
logger.debug("Loading expected config")
with open('/opt/etc/expected-config.yml', 'r') as file:
expected_config = yaml.safe_load(file)
logger.debug(expected_config)
current_config_is_synced = current_config == expected_config
while not current_config_is_synced:
logger.info("Current config does not match expected config")
logger.debug("Reloading prometheus config")
conn = http.client.HTTPConnection('prometheus:9090')
conn.request('POST', '/-/reload')
response = conn.getresponse()
if response.status < 200 or response.status >= 300:
logger.error("Got unexpected response from Prometheus API")
logger.debug(response.getheaders())
time.sleep(5)
current_config_is_synced = get_current_config() == expected_config
logger.info("Current config matches expected config. Exiting")
sys.exit(0)
---
apiVersion: batch/v1
kind: Job
metadata:
labels:
job-name: reload-prometheus-config
name: reload-prometheus-config
spec:
backoffLimit: 6
completions: 1
parallelism: 1
ttlSecondsAfterFinished: 3600
selector:
matchLabels:
job-name: reload-prometheus-config
template:
metadata:
labels:
job-name: reload-prometheus-config
spec:
containers:
- image: jujusolutions/jujud-operator:2.7.6
volumeMounts:
- name: reload-config-script
mountPath: /opt/bin
- name: expected-config-yaml
mountPath: /opt/etc
command:
- /usr/bin/python3
args:
- '/opt/bin/reload-config.py'
imagePullPolicy: IfNotPresent
name: reload-prometheus-config
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumes:
- name: reload-config-script
configMap:
name: reload-config-script
- name: expected-config-yaml
configMap:
name: expected-config-yaml
restartPolicy: Never
terminationGracePeriodSeconds: 300
Found a quicker way to reload the config:
ubuntu@node10:~$ kubectl rollout restart statefulset/prometheus -n lma16
statefulset.apps/prometheus restarted
That, basically, restarts every application pod so it reloads the ConfigMap. The only thing here left to verify if that can work in scope of the config-changed
(e.g if the rollout restart can be invoked in the config-changed
, exiting the latter immediately after the invocation).
@exceptorr I just realized: if you run this within the context of config-changed, doesn't that mean that the juju podspec, hence the configmap too, has not been updated yet?
@relaxdiego You are right, but here's a plan:
1) Charm issues a k8s api call to initialize the rollout restart (in scope of config-changed
)
2) Charm sets a StoredState.reload-verify
flag, doing a .defer()
(in scope of config-changed
)
2) config-changed exits -> ConfigMap updates
3) restarted Pods are fetching the new ConfigMap
4) if something has been missing - the deferred event will reload the config from the updated ConfigMap, since we are not anymore in the config-changed
context and the ConfigMap has been updated.
Moreover, the rollout restart
is detected by Juju itself (e.g app units will not be in active/idle
state), so the operator will be aware of the re-configuration process which is in progress.
What do you think?
@exceptorr there's a potential race condition between the time the pods restart and the time config-changed exits. That means that there's still a probability that the Prometheus configuration is not updated until 3~5 minutes later when the defered event is re-triggered.
That's why the reload-verify
flag is needed - after the pod is restarted (basically, re-created from the Juju perspective), it will re-invoke the config-changed
handler which will tell Prometheus to reload its config from the ConfigMap one more time, if this flag is set.
I'm not familiar with reload-verify
at this point. Let's see it in code then! :-)
This is meant to be a StoredState
property, like is_started
in src/charm.py:125:
if not state.is_started:
state.is_started = True
Just another flag to improve the control and execution flow. I will work on that on Monday so you can review it soon. :-)
Ok, so leaving rollout restart
aside, let's stick with the original plan of using .defer()
, so @relaxdiego one more review?
Known issue: one of the tests OnConfigChangedHandlerTest.test__it_blocks_until_pod_is_ready
fails:
# Exercise
charm.on_config_changed_handler(mock_event,
mock_fw_adapter,
> mock_state)
test/charm_test.py:117:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/charm.py:137: in on_config_changed_handler
if not state.is_started:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <NonCallableMagicMock name='StoredState()' spec_set='StoredState' id='4521847824'>, name = 'is_started'
def __getattr__(self, name):
if name in {'_mock_methods', '_mock_unsafe'}:
raise AttributeError(name)
elif self._mock_methods is not None:
if name not in self._mock_methods or name in _all_magics:
> raise AttributeError("Mock object has no attribute %r" % name)
E AttributeError: Mock object has no attribute 'is_started'
To be addressed.
Known issue: one of the tests
OnConfigChangedHandlerTest.test__it_blocks_until_pod_is_ready
fails:
I've addressed it via https://github.com/charmed-lma/charm-k8s-prometheus/pull/20/commits/2e64824df245fafe749db594843d35fe0ea0d365. Now all that's needed is for additional unit tests that cover the new logic you added to charm.py
and domain.py
Refactored, added more unit tests, deployed, verified, ready for the final review. @relaxdiego your turn! ;-)
Two comments to address (@relaxdiego - any ideas will be very welcome and appreciated!):
1) There is some time (~1 min) between the ConfigMap change (read: after the juju applies the new pod spec so ConfigMap is actually changed) and the new config file being actually rendered on the filesystem. Therefore, calling
/-/reload
in theconfig-changed
hook immediately does not make any sense, since new config has not arrived. Still need to think about how can we wait for the new config file or make the process faster (e.g why the config file arrives with such noticeable delay? Do we have any other option to deliver the config file to the service container?) 2) Failing unit test: