canonical / hotsos

Software analysis toolkit. Define checks in high-level language and leverage library to perform analysis of common Cloud applications.
Apache License 2.0
32 stars 38 forks source link

check for "experiencing BlueFS spillover" in ceph #258

Closed nkshirsagar closed 2 years ago

nkshirsagar commented 2 years ago

as described in https://tracker.ceph.com/issues/38745 , and case 00326782, hotsos should check for message "BlueFS spillover detected " to check for the rocksdb's level-sized issue.

$ ceph -s cluster: id: 815ea021-7839-4a63-9dc1-14f8c5feecc6 health: HEALTH_WARN BlueFS spillover detected on 1 OSD(s) <--

Also, even if this situation has not been hit, we should check between ceph-osd versions 15.2.6->15.2.10 if the option bluestore_volume_selection_policy is not set to "use_some_extra" because then the vulnerability still exists.

So this is the second check to be made, to check for the vulnerability if the ceph packages fall between 15.2.6 and 15.2.10,

$ sudo ceph daemon osd.1 config show|grep -i policy "bluestore_volume_selection_policy": "use_some_extra",

dosaboy commented 2 years ago

The following can be used as a guide on how to implement this: https://github.com/canonical/hotsos/blob/master/defs/scenarios/storage/ceph/auth_insecure_global_id_reclaim_allowed.yaml

nkshirsagar commented 2 years ago

There are two parts to this check.

a) simply check if we see BlueFS spillover detected in the ceph status, or the detailed status, and report the bug if that string is found.

b) Check between ceph-osd versions 15.2.6->15.2.10 to see if bluestore_volume_selection_policy is not set to use_some_extra and if that is the case, report the vulnerability.

I can see that a) is easily implemented using the approach shown in https://github.com/canonical/hotsos/blob/master/defs/scenarios/storage/ceph/auth_insecure_global_id_reclaim_allowed.yaml but I am not sure if the scenarios can also take min-version, fixed-version, etc like how the bugs do, and how I'd check for the value of bluestore_volume_selection_policy. Would I need to add a property to the class CephMon(CephDaemonBase) and return that value? just like we do for osdmaps_count?

class CephMon(CephDaemonBase):

    def __init__(self):
        super().__init__('mon')

    @property
    def osdmaps_count(self):
        report = self.cli.ceph_report_json_decoded()
        if not report:
            return 0

        try:
            return len(report['osdmap_manifest']['pinned_maps'])
        except (ValueError, KeyError):
            return 0
dosaboy commented 2 years ago

@nkshirsagar so you can implement this is one of two ways; either using bugs (i have added support for the requires property to bugchecks [1]) or as a scenario. If you use a scenario you will need to implement all the requirements as properties (which you can then call using the requires property). Its up to you which one fits best.

[1] https://github.com/canonical/hotsos/tree/master/defs#bugs

nkshirsagar commented 2 years ago

thank you for https://github.com/canonical/hotsos/pull/273/files @dosaboy++

I am trying to consume it this way, as a command, with both the A and B parts defined in bugs/storage/ceph.yaml , with one bug check under 38745 ,and the vulnerability under vulnerable_to_38745

38745:
  settings:
    package: 'ceph-osd'
  input:
    command: ceph_health_detail_json_decoded
  expr: '.+BlueFS spillover detected on'
  hint: 'spillover'
  raises:
    message: >-
      installed package '{package_name}' with version {version_current} has a
      known bug https://tracker.ceph.com/issues/38745.
vulnerable_to_38745:
  input:
    command: ceph_daemon_config_show
    options:
      kwargs:
        name: 'bluestore_volume_selection_policy'
    value: 'use_some_extra'
    operator: ne
  settings:
    package: 'ceph-osd'
    versions-affected:
      - min-broken: 15.2.6
        min-fixed: 15.2.11
  raises:
    message: >-
      Vulnerability to a known bug https://tracker.ceph.com/issues/38745.
      Please set bluestore_volume_selection_policy of all OSDs to use_some_extra
dosaboy commented 2 years ago

posting info from discussion with @nkshirsagar

now that https://github.com/canonical/hotsos/commit/21ad59ed5878c03afc9a1de5553181c93713cb64 is landed you can do:

requires: or:

And lose the input: property. This way you get to check config for all osds.

nkshirsagar commented 2 years ago

Thank you for the help @dosaboy

But I'm still having a build failure. I've simplified the yaml to this

38745:
  input:
    command: ceph_health_detail_json_decoded
  expr: '.+BlueFS spillover detected on'
  hint: 'spillover'
  raises:
    message: >-
      installed package '{package_name}' with version {version_current} has a
      known bug https://tracker.ceph.com/issues/38745.
38745:
  settings:
    package: 'ceph-osd'
    versions-affected:
      - min-broken: 15.2.6
        min-fixed: 15.2.11
  requires:
    or:
      - property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
        value: [use_some_extra]
        operator: ne
      - property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
        value: []
        operator: ne
  raises:
    message: >-
      Vulnerability to a known bug https://tracker.ceph.com/issues/38745.
      Please set bluestore_volume_selection_policy of all OSDs to use_some_extra

THe build fails with

======================================================================
ERROR: test_bug_checks (test_storage.TestStorageBugChecks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/.tox/py3/lib/python3.8/site-packages/mock/mock.py", line 1346, in patched
    return func(*newargs, **newkeywargs)
  File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/tests/unit/test_storage.py", line 641, in test_bug_checks
    YBugChecker()()
  File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/core/ycheck/__init__.py", line 630, in __call__
    return self.run_checks()
  File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/core/ycheck/__init__.py", line 620, in run_checks
    ret = self.run(self.searchobj.search())
  File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/core/ycheck/bugs.py", line 133, in run
    pkg_ver = bugsearch['context'].apt_all.get(pkg)
AttributeError: 'NoneType' object has no attribute 'apt_all'
-------------------- >> begin captured logging << --------------------
root: DEBUG: loading bugs definitions for plugin=storage
root: DEBUG: loaded plugin 'storage' bugs - sections=1, events=1
root: DEBUG: bug=38745 path=None
root: DEBUG: no search terms registered so nothing to do
root: DEBUG: requirements provided as groups
root: DEBUG: op=or has 2 requirement(s)
root: DEBUG: calling property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
root: DEBUG: creating filesearcher with max=2 processes
root: DEBUG: path=/tmp/tmp7wfcztu4/tmp9ryz5_2r
root: DEBUG: files=1 searches=1
root: DEBUG: completed searches (results=3)
root: DEBUG: requirement check: property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy <built-in function eq> ['use_some_extra'] (result=True)
root: DEBUG: calling property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
root: DEBUG: creating filesearcher with max=2 processes
root: DEBUG: path=/tmp/tmp7wfcztu4/tmpye383pma
root: DEBUG: files=1 searches=1
root: DEBUG: completed searches (results=3)
root: DEBUG: requirement check: property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy <built-in function eq> [] (result=False)
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 210 tests in 12.082s

FAILED (errors=1)
ERROR: InvocationError for command /home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/.tox/py3/bin/nosetests --verbose tests/unit (exited with code 1)

(I've used the same bug number for both the sections.)

It seems like hotsos expects a context section in the yaml. Looking at core/ycheck/bugs.py at the code thats failing

            settings = bugsearch['settings']
            if settings and settings.versions_affected and settings.package:
                pkg = settings.package
                pkg_ver = bugsearch['context'].apt_all.get(pkg) <===
                if pkg_ver:

I see in the openstack folder, openstack.yaml contains,

# This file is used to define overrides applicable to contents of this
# directory including subdirectories.
requires:
  property: core.plugins.openstack.OpenstackChecksBase.plugin_runnable
context:
  apt-all: core.plugins.openstack.OpenstackBase.apt_packages_all

and the OpenstackBase class contains,

class OpenstackBase(object):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.ost_projects = OSTProjectCatalog()
        other_pkgs = self.ost_projects.package_dependencies
        self.apt_check = checks.APTPackageChecksBase(
                                  core_pkgs=self.ost_projects.packages_core,
                                  other_pkgs=other_pkgs)
        self.nova = NovaBase()
        self.neutron = NeutronBase()
        self.octavia = OctaviaBase()

    @property
    def apt_packages_all(self):   <--
        return self.apt_check.all

Would I need something equivalent in a new defs/bugs/storage/ceph.yaml file like the openstack.yaml? Though there is no apt_packages_all defined in Storagebase , or CephChecksBase that I can see.

dosaboy commented 2 years ago

Ok so, following our latest conversation I have (hopefully) simplified things even further. See https://github.com/canonical/hotsos/commit/9f80f5928ac4ece39ff81494f8208c858a0c73ac but in short, I have removed the content property entirely (so if you have added it to fix your issues above you can remove it again) and I have removed the dependency on the settings property from the bugs handler and extended the apt type in the requires property to support version ranges. So as an example you could write your (second) bug check as follows:

38745:
  requires:
    and:
      - apt:
          ceph-osd:
            - min: 15.2.6
              max: 15.2.10
    or:
      - property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
        value: [use_some_extra]
        operator: ne
      - property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
        value: []
        operator: ne
  raises:
    message: >-
      Vulnerability to a known bug https://tracker.ceph.com/issues/38745.
      Please set bluestore_volume_selection_policy of all OSDs to use_some_extra

Note that in your yaml above you have two checks keyed with ID 38745 and since this translates to a dictionary the second one will overwrite the first. So basically you cant define two checks with the same bug id. Now that we have support for checking package versions in a requires property you could consider implementing the check as a scenario - so a have a bug check for the first check (the one that looks for a log file line) and a scenario check for ^^ one. Or implement both as a scenario, up to you.

nkshirsagar commented 2 years ago

Hi @dosaboy , I've implemented the vulnerability as a bug and the detection of the actual log string as a scenario. I left the vulnerability in the bug since I figured I'd need to implement the requirements as properties if I moved it to a scenario. The actual string check in the logs did not need to check any properties so I moved it to scenarios. There's a problem though, since there's no (from what I can see) launchpad bug opened here, so I want to rely on the upstream bug. However, the "id" field in the hotsos output always appends the bug number to the launchpad URL...

So when I test it on a sosreport, I see,

  bugs-detected:
    - desc: Vulnerability to a known bug https://tracker.ceph.com/issues/38745. Please
        set bluestore_volume_selection_policy of all OSDs to use_some_extra
      id: https://bugs.launchpad.net/bugs/38745
      origin: storage.auto_bug_check

I also see in potential-issues, the scenario check show up properly.. It needs some justify or line wrap to not have the one long line at the end.. shall figure that out

  potential-issues:
    CephWarnings:
      - 'This node has Ceph OSDs running on it but is not using cpufreq scaling_governor
        in "performance" mode (actual=ondemand). This is not recommended and can result
        in performance degradation. To fix this you can install cpufrequtils and set
        "GOVERNOR=performance" in /etc/default/cpufrequtils. You will also need to
        disable the ondemand systemd service in order for changes to persist. NOTE:
        requires node reboot to take effect. (origin=storage.auto_scenario_check)'
      - Known ceph bug https://tracker.ceph.com/issues/38745 detected. See https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
        (origin=storage.auto_scenario_check)

I've sent this initial PR just to show you the way I'm doing it.. if it is OK to do it this way then I can start to work on the unit test.