charmed-lma / charm-k8s-prometheus

Kubernetes Operator for Prometheus
Apache License 2.0
4 stars 2 forks source link

Reload configuration file when changed #20

Closed exceptorr closed 4 years ago

exceptorr commented 4 years ago

Two comments to address (@relaxdiego - any ideas will be very welcome and appreciated!):

1) There is some time (~1 min) between the ConfigMap change (read: after the juju applies the new pod spec so ConfigMap is actually changed) and the new config file being actually rendered on the filesystem. Therefore, calling /-/reload in the config-changed hook immediately does not make any sense, since new config has not arrived. Still need to think about how can we wait for the new config file or make the process faster (e.g why the config file arrives with such noticeable delay? Do we have any other option to deliver the config file to the service container?) 2) Failing unit test:

______________________________________________ OnConfigChangedHandlerTest.test__it_blocks_until_pod_is_ready ______________________________________________

self = <charm_test.OnConfigChangedHandlerTest testMethod=test__it_blocks_until_pod_is_ready>
mock_stored_state_cls = <MagicMock name='StoredState' spec_set='StoredState' id='4358979408'>, mock_pod_spec = <function set_juju_pod_spec at 0x103d3b560>
mock_juju_pod_spec = <function build_juju_pod_spec at 0x103d5ae60>, mock_time = <NonCallableMagicMock name='time' spec_set='module' id='4358998224'>
mock_k8s_mod = <NonCallableMagicMock name='k8s' spec_set='module' id='4359109776'>
mock_build_juju_unit_status_func = <function build_juju_unit_status at 0x103d5add0>

    @patch('charm.build_juju_unit_status', spec_set=True, autospec=True)
    @patch('charm.k8s', spec_set=True, autospec=True)
    @patch('charm.time', spec_set=True, autospec=True)
    @patch('charm.build_juju_pod_spec', spec_set=True, autospec=True)
    @patch('charm.set_juju_pod_spec', spec_set=True, autospec=True)
    @patch('charm.StoredState', spec_set=True, autospec=True)
    def test__it_blocks_until_pod_is_ready(
            self,
            mock_stored_state_cls,
            mock_pod_spec,
            mock_juju_pod_spec,
            mock_time,
            mock_k8s_mod,
            mock_build_juju_unit_status_func):
        # Setup
        mock_fw_adapter_cls = \
            create_autospec(framework.FrameworkAdapter, spec_set=True)
        mock_fw_adapter = mock_fw_adapter_cls.return_value

        mock_juju_unit_states = [
            MaintenanceStatus(str(uuid4())),
            MaintenanceStatus(str(uuid4())),
            ActiveStatus(str(uuid4())),
        ]
        mock_build_juju_unit_status_func.side_effect = mock_juju_unit_states

        mock_event_cls = create_autospec(EventBase, spec_set=True)
        mock_event = mock_event_cls.return_value

        mock_state = mock_stored_state_cls.return_value
>       mock_state.set_default(is_started=False)

test/charm_test.py:113:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <NonCallableMagicMock name='StoredState()' spec_set='StoredState' id='4355493072'>, name = 'set_default'

    def __getattr__(self, name):
        if name in {'_mock_methods', '_mock_unsafe'}:
            raise AttributeError(name)
        elif self._mock_methods is not None:
            if name not in self._mock_methods or name in _all_magics:
>               raise AttributeError("Mock object has no attribute %r" % name)
E               AttributeError: Mock object has no attribute 'set_default'

/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/mock.py:598: AttributeError
relaxdiego commented 4 years ago
1. There is some time (~1 min) between the ConfigMap change (read: after the juju applies the new pod spec so ConfigMap is actually changed) and the new config file being actually rendered on the filesystem. Therefore, calling `/-/reload` in the `config-changed` hook immediately does not make any sense, since new config has not arrived. Still need to think about how can we wait for the new config file or make the process faster (e.g why the config file arrives with such noticeable delay? Do we have any other option to deliver the config file to the service container?)

I wonder if it has something to do with this: https://kubernetes.io/docs/concepts/configuration/configmap/#mounted-configmaps-are-updated-automatically

If the delay is caused by some setting in the underlying k8s cluster, then we'll have take that into consideration and maybe keep polling and somehow find out that the file has finally updated before telling prometheus to reload.

2. Failing unit test:
______________________________________________ OnConfigChangedHandlerTest.test__it_blocks_until_pod_is_ready ______________________________________________

self = <charm_test.OnConfigChangedHandlerTest testMethod=test__it_blocks_until_pod_is_ready>
mock_stored_state_cls = <MagicMock name='StoredState' spec_set='StoredState' id='4358979408'>, mock_pod_spec = <function set_juju_pod_spec at 0x103d3b560>
mock_juju_pod_spec = <function build_juju_pod_spec at 0x103d5ae60>, mock_time = <NonCallableMagicMock name='time' spec_set='module' id='4358998224'>
mock_k8s_mod = <NonCallableMagicMock name='k8s' spec_set='module' id='4359109776'>
mock_build_juju_unit_status_func = <function build_juju_unit_status at 0x103d5add0>

    @patch('charm.build_juju_unit_status', spec_set=True, autospec=True)
    @patch('charm.k8s', spec_set=True, autospec=True)
    @patch('charm.time', spec_set=True, autospec=True)
    @patch('charm.build_juju_pod_spec', spec_set=True, autospec=True)
    @patch('charm.set_juju_pod_spec', spec_set=True, autospec=True)
    @patch('charm.StoredState', spec_set=True, autospec=True)
    def test__it_blocks_until_pod_is_ready(
            self,
            mock_stored_state_cls,
            mock_pod_spec,
            mock_juju_pod_spec,
            mock_time,
            mock_k8s_mod,
            mock_build_juju_unit_status_func):
        # Setup
        mock_fw_adapter_cls = \
            create_autospec(framework.FrameworkAdapter, spec_set=True)
        mock_fw_adapter = mock_fw_adapter_cls.return_value

        mock_juju_unit_states = [
            MaintenanceStatus(str(uuid4())),
            MaintenanceStatus(str(uuid4())),
            ActiveStatus(str(uuid4())),
        ]
        mock_build_juju_unit_status_func.side_effect = mock_juju_unit_states

        mock_event_cls = create_autospec(EventBase, spec_set=True)
        mock_event = mock_event_cls.return_value

        mock_state = mock_stored_state_cls.return_value
>       mock_state.set_default(is_started=False)

test/charm_test.py:113:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <NonCallableMagicMock name='StoredState()' spec_set='StoredState' id='4355493072'>, name = 'set_default'

    def __getattr__(self, name):
        if name in {'_mock_methods', '_mock_unsafe'}:
            raise AttributeError(name)
        elif self._mock_methods is not None:
            if name not in self._mock_methods or name in _all_magics:
>               raise AttributeError("Mock object has no attribute %r" % name)
E               AttributeError: Mock object has no attribute 'set_default'

/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/mock.py:598: AttributeError

This is due to the fact that framework.StoredState doesn't statically define a set_default method. Rather it uses some metaprogramming to set it which @patch doesn't see so it's telling you the method doesn't exist in the mocked object. You could spec_set=False but that runs the risk of your mock's method signatures going out of sync with the methods of the real object.

exceptorr commented 4 years ago

Still in progress and missing some unit tests, but at least now it waits until the ConfigMap will be applied to the pod.

@relaxdiego mind reviewing, while I'm working on the tests?

relaxdiego commented 4 years ago

I modified your code slightly to use event.defer() as follows:

if not state.is_started:
    state.is_started = True
else:
    logging.debug("Call HTTP config reload")
    try:
        fw_adapter.set_unit_status(MaintenanceStatus(
            "Waiting for config parameters propagation"
        ))
        expected_config = build_prometheus_config(fw_adapter.get_config())
        logging.debug(
            "Awaiting Prom config to become: {0}".format(expected_config)
        )

        # Wait, until ConfigMap changes will be propagated
        prometheus_config_reload(juju_model, juju_app)
        new_config_applied = prometheus_ensure_config_propagated(
            juju_model, juju_app, expected_config
        )
        if not new_config_applied:
            logger.debug("**** Config not propagated. DEFERRING")
            logger.debug("**** Config not propagated. DEFERRING")
            logger.debug("**** Config not propagated. DEFERRING")
            event.defer()
        else:
            logging.debug("Config reloaded")
            fw_adapter.set_unit_status(ActiveStatus())
    except CharmError as e:
        error_msg = "Config reload error: {0}".format(e)
        logging.error(error_msg)
        fw_adapter.set_unit_status(BlockedStatus(error_msg))

It works BUT there is (AT LEAST) a 3 minute delay:

application-prometheus: 22:47:10 DEBUG unit.prometheus/0.juju-log **** Config not propagated. DEFERRING
application-prometheus: 22:47:10 DEBUG unit.prometheus/0.juju-log **** Config not propagated. DEFERRING
application-prometheus: 22:47:10 DEBUG unit.prometheus/0.juju-log **** Config not propagated. DEFERRING
application-prometheus: 22:50:34 DEBUG unit.prometheus/0.juju-log Building Juju pod spec
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Build prom config: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Rendered CLI args: --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --web.enable-lifecycle --web.console.templates=/usr/share/prometheus/consoles --web.console.libraries=/usr/share/prometheus/console_libraries --log.level=info --web.page-title="Charmed LMA - Prometheus TSDB, Collection & Processing" --web.max-connections=512 --storage.tsdb.retention.time=15d --alertmanager.notification-queue-capacity=10000 --alertmanager.timeout=10s
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Configuring pod: set PodSpec to: {'containers': [{'name': 'prometheus', 'imageDetails': {'imagePath': 'prom/prometheus:v2.18.1', 'username': '', 'password': ''}, 'args': ['--config.file=/etc/prometheus/prometheus.yml', '--storage.tsdb.path=/prometheus', '--web.enable-lifecycle', '--web.console.templates=/usr/share/prometheus/consoles', '--web.console.libraries=/usr/share/prometheus/console_libraries', '--log.level=info', '--web.page-title="Charmed LMA - Prometheus TSDB, Collection & Processing"', '--web.max-connections=512', '--storage.tsdb.retention.time=15d', '--alertmanager.notification-queue-capacity=10000', '--alertmanager.timeout=10s'], 'ports': [{'containerPort': 9090, 'protocol': 'TCP'}], 'readinessProbe': {'httpGet': {'path': '/-/ready', 'port': 9090}, 'initialDelaySeconds': 10, 'timeoutSeconds': 30}, 'livenessProbe': {'httpGet': {'path': '/-/healthy', 'port': 9090}, 'initialDelaySeconds': 30, 'timeoutSeconds': 30}, 'files': [{'name': 'config', 'mountPath': '/etc/prometheus', 'files': {'prometheus.yml': "alerting: {}\nglobal:\n  evaluation_interval: 1m\n  external_labels: {}\n  scrape_interval: 100s\n  scrape_timeout: 10s\nscrape_configs:\n- honor_timestamps: true\n  job_name: prometheus\n  metrics_path: /metrics\n  scheme: http\n  scrape_interval: 5s\n  scrape_timeout: 5s\n  static_configs:\n  - targets: ['localhost:9090']\n"}}]}]}
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Checking k8s pod readiness
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log GET kubernetes.default.svc//api/v1/namespaces/lma/pods?labelSelector=juju-app=prometheus
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Received k8s pod status: <adapters.k8s.PodStatus object at 0x7fed2628a4a8>
application-prometheus: 22:50:35 DEBUG unit.prometheus/0.juju-log Built unit status: ActiveStatus('')
application-prometheus: 22:50:36 DEBUG unit.prometheus/0.juju-log Call HTTP config reload
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Build prom config: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Awaiting Prom config to become: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Calling Prom API: POST /-/reload
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Calling Prom API: GET /api/v1/status/config
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Expected: {'global': {'external_labels': {}, 'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log No external_labels received, adding empty key
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log No alerting received, adding empty key
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Received from API: {'global': {'scrape_interval': '100s', 'scrape_timeout': '10s', 'evaluation_interval': '1m', 'external_labels': {}}, 'scrape_configs': [{'job_name': 'prometheus', 'honor_timestamps': True, 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {}}
application-prometheus: 22:50:37 DEBUG unit.prometheus/0.juju-log Config reloaded

3 minutes is not ideal so I will try the sidecar container approach for the rest of today.

relaxdiego commented 4 years ago

Referencing blocker to this PR https://bugs.launchpad.net/juju/+bug/1882976

relaxdiego commented 4 years ago

Status so far: I tried using JujuPodSpec V2 as described here because it had the feature where you can independently define a ConfigMap resource. I was hoping to use that in both the main container and the sidecar container. HOWEVER there seems to be no way to mount that independent ConfigMap to any of the containers. At least, that's what I can tell from the V2 specs code. Trying a different approach.

relaxdiego commented 4 years ago

The sidecar container approach is not worth the effort and is too much of a hack. Abandoning that. I would suggest just adding an action that tells prometheus to reload the config.

relaxdiego commented 4 years ago

@exceptorr just so you know I was able to get a k8s Job to call /-/reload. Below is the manifest. For now I've just been applying that to k8s using microk8s.kubectl apply -f <path-to-below-manifest> but a tiny modification to the k8s adapter can replicate that manual process.

apiVersion: v1
kind: ConfigMap
metadata:
  name: expected-config-yaml
data:
  expected-config.yml: |
    ---
    global:
      scrape_interval: '1m'
      scrape_timeout: '10s'
      evaluation_interval: '1m'
    scrape_configs:
      - job_name: 'prometheus'
        honor_timestamps: True
        scrape_interval: '5s'
        scrape_timeout: '5s'
        metrics_path: '/metrics'
        scheme: 'http'
        static_configs:
          - targets:
              - 'localhost:9090'
      - job_name: 'some-other-service'
        honor_timestamps: True
        scrape_interval: '5s'
        scrape_timeout: '5s'
        metrics_path: '/metrics'
        scheme: 'http'
        static_configs:
          - targets:
              - 'alertmanager:9093'
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: reload-config-script
data:
  reload-config.py: |
    import http.client
    import json
    import logging
    import sys
    import yaml
    import time

    logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
    logger = logging.getLogger()

    def get_current_config():
      logger.debug("Getting current Prometheus config")
      conn = http.client.HTTPConnection('prometheus:9090')
      conn.request('GET', '/api/v1/status/config')
      response = conn.getresponse()

      logger.debug("Get response.status={}".format(response.status))

      if response.status < 200 or response.status >= 300:
          logger.error("Failed to get config from Prometheus")
          sys.exit(1)

      payload = json.loads(response.read())
      return yaml.safe_load(payload['data']['yaml'])

    current_config = get_current_config()
    logger.debug(current_config)

    logger.debug("Loading expected config")

    with open('/opt/etc/expected-config.yml', 'r') as file:
      expected_config = yaml.safe_load(file)

    logger.debug(expected_config)

    current_config_is_synced = current_config == expected_config

    while not current_config_is_synced:
      logger.info("Current config does not match expected config")
      logger.debug("Reloading prometheus config")
      conn = http.client.HTTPConnection('prometheus:9090')
      conn.request('POST', '/-/reload')
      response = conn.getresponse()

      if response.status < 200 or response.status >= 300:
        logger.error("Got unexpected response from Prometheus API")
        logger.debug(response.getheaders())

      time.sleep(5)
      current_config_is_synced = get_current_config() == expected_config

    logger.info("Current config matches expected config. Exiting")
    sys.exit(0)
---
apiVersion: batch/v1
kind: Job
metadata:
  labels:
    job-name: reload-prometheus-config
  name: reload-prometheus-config
spec:
  backoffLimit: 6
  completions: 1
  parallelism: 1
  ttlSecondsAfterFinished: 3600
  selector:
    matchLabels:
      job-name: reload-prometheus-config
  template:
    metadata:
      labels:
        job-name: reload-prometheus-config
    spec:
      containers:
      - image: jujusolutions/jujud-operator:2.7.6
        volumeMounts:
          - name: reload-config-script
            mountPath: /opt/bin
          - name: expected-config-yaml
            mountPath: /opt/etc
        command:
          - /usr/bin/python3
        args:
          - '/opt/bin/reload-config.py'
        imagePullPolicy: IfNotPresent
        name: reload-prometheus-config
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      volumes:
        - name: reload-config-script
          configMap:
            name: reload-config-script
        - name: expected-config-yaml
          configMap:
            name: expected-config-yaml
      restartPolicy: Never
      terminationGracePeriodSeconds: 300
exceptorr commented 4 years ago

Found a quicker way to reload the config:

ubuntu@node10:~$ kubectl rollout restart statefulset/prometheus -n lma16
statefulset.apps/prometheus restarted

That, basically, restarts every application pod so it reloads the ConfigMap. The only thing here left to verify if that can work in scope of the config-changed (e.g if the rollout restart can be invoked in the config-changed, exiting the latter immediately after the invocation).

relaxdiego commented 4 years ago

@exceptorr I just realized: if you run this within the context of config-changed, doesn't that mean that the juju podspec, hence the configmap too, has not been updated yet?

exceptorr commented 4 years ago

@relaxdiego You are right, but here's a plan:

1) Charm issues a k8s api call to initialize the rollout restart (in scope of config-changed) 2) Charm sets a StoredState.reload-verify flag, doing a .defer() (in scope of config-changed) 2) config-changed exits -> ConfigMap updates 3) restarted Pods are fetching the new ConfigMap 4) if something has been missing - the deferred event will reload the config from the updated ConfigMap, since we are not anymore in the config-changed context and the ConfigMap has been updated.

Moreover, the rollout restart is detected by Juju itself (e.g app units will not be in active/idle state), so the operator will be aware of the re-configuration process which is in progress.

What do you think?

relaxdiego commented 4 years ago

@exceptorr there's a potential race condition between the time the pods restart and the time config-changed exits. That means that there's still a probability that the Prometheus configuration is not updated until 3~5 minutes later when the defered event is re-triggered.

exceptorr commented 4 years ago

That's why the reload-verify flag is needed - after the pod is restarted (basically, re-created from the Juju perspective), it will re-invoke the config-changed handler which will tell Prometheus to reload its config from the ConfigMap one more time, if this flag is set.

relaxdiego commented 4 years ago

I'm not familiar with reload-verify at this point. Let's see it in code then! :-)

exceptorr commented 4 years ago

This is meant to be a StoredState property, like is_started in src/charm.py:125:

if not state.is_started:
        state.is_started = True

Just another flag to improve the control and execution flow. I will work on that on Monday so you can review it soon. :-)

exceptorr commented 4 years ago

Ok, so leaving rollout restart aside, let's stick with the original plan of using .defer(), so @relaxdiego one more review?

Known issue: one of the tests OnConfigChangedHandlerTest.test__it_blocks_until_pod_is_ready fails:


        # Exercise
        charm.on_config_changed_handler(mock_event,
                                        mock_fw_adapter,
>                                       mock_state)

test/charm_test.py:117:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/charm.py:137: in on_config_changed_handler
    if not state.is_started:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <NonCallableMagicMock name='StoredState()' spec_set='StoredState' id='4521847824'>, name = 'is_started'

    def __getattr__(self, name):
        if name in {'_mock_methods', '_mock_unsafe'}:
            raise AttributeError(name)
        elif self._mock_methods is not None:
            if name not in self._mock_methods or name in _all_magics:
>               raise AttributeError("Mock object has no attribute %r" % name)
E               AttributeError: Mock object has no attribute 'is_started'

To be addressed.

relaxdiego commented 4 years ago

Known issue: one of the tests OnConfigChangedHandlerTest.test__it_blocks_until_pod_is_ready fails:

I've addressed it via https://github.com/charmed-lma/charm-k8s-prometheus/pull/20/commits/2e64824df245fafe749db594843d35fe0ea0d365. Now all that's needed is for additional unit tests that cover the new logic you added to charm.py and domain.py

exceptorr commented 4 years ago

Refactored, added more unit tests, deployed, verified, ready for the final review. @relaxdiego your turn! ;-)