Closed nkshirsagar closed 2 years ago
The following can be used as a guide on how to implement this: https://github.com/canonical/hotsos/blob/master/defs/scenarios/storage/ceph/auth_insecure_global_id_reclaim_allowed.yaml
There are two parts to this check.
a) simply check if we see BlueFS spillover detected in the ceph status, or the detailed status, and report the bug if that string is found.
b) Check between ceph-osd versions 15.2.6->15.2.10 to see if bluestore_volume_selection_policy is not set to use_some_extra and if that is the case, report the vulnerability.
I can see that a) is easily implemented using the approach shown in https://github.com/canonical/hotsos/blob/master/defs/scenarios/storage/ceph/auth_insecure_global_id_reclaim_allowed.yaml but I am not sure if the scenarios can also take min-version, fixed-version, etc like how the bugs do, and how I'd check for the value of bluestore_volume_selection_policy. Would I need to add a property to the class CephMon(CephDaemonBase) and return that value? just like we do for osdmaps_count?
class CephMon(CephDaemonBase):
def __init__(self):
super().__init__('mon')
@property
def osdmaps_count(self):
report = self.cli.ceph_report_json_decoded()
if not report:
return 0
try:
return len(report['osdmap_manifest']['pinned_maps'])
except (ValueError, KeyError):
return 0
@nkshirsagar so you can implement this is one of two ways; either using bugs (i have added support for the requires property to bugchecks [1]) or as a scenario. If you use a scenario you will need to implement all the requirements as properties (which you can then call using the requires property). Its up to you which one fits best.
[1] https://github.com/canonical/hotsos/tree/master/defs#bugs
thank you for https://github.com/canonical/hotsos/pull/273/files @dosaboy++
I am trying to consume it this way, as a command, with both the A and B parts defined in bugs/storage/ceph.yaml , with one bug check under 38745 ,and the vulnerability under vulnerable_to_38745
38745:
settings:
package: 'ceph-osd'
input:
command: ceph_health_detail_json_decoded
expr: '.+BlueFS spillover detected on'
hint: 'spillover'
raises:
message: >-
installed package '{package_name}' with version {version_current} has a
known bug https://tracker.ceph.com/issues/38745.
vulnerable_to_38745:
input:
command: ceph_daemon_config_show
options:
kwargs:
name: 'bluestore_volume_selection_policy'
value: 'use_some_extra'
operator: ne
settings:
package: 'ceph-osd'
versions-affected:
- min-broken: 15.2.6
min-fixed: 15.2.11
raises:
message: >-
Vulnerability to a known bug https://tracker.ceph.com/issues/38745.
Please set bluestore_volume_selection_policy of all OSDs to use_some_extra
posting info from discussion with @nkshirsagar
now that https://github.com/canonical/hotsos/commit/21ad59ed5878c03afc9a1de5553181c93713cb64 is landed you can do:
requires: or:
And lose the input: property. This way you get to check config for all osds.
Thank you for the help @dosaboy
But I'm still having a build failure. I've simplified the yaml to this
38745:
input:
command: ceph_health_detail_json_decoded
expr: '.+BlueFS spillover detected on'
hint: 'spillover'
raises:
message: >-
installed package '{package_name}' with version {version_current} has a
known bug https://tracker.ceph.com/issues/38745.
38745:
settings:
package: 'ceph-osd'
versions-affected:
- min-broken: 15.2.6
min-fixed: 15.2.11
requires:
or:
- property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
value: [use_some_extra]
operator: ne
- property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
value: []
operator: ne
raises:
message: >-
Vulnerability to a known bug https://tracker.ceph.com/issues/38745.
Please set bluestore_volume_selection_policy of all OSDs to use_some_extra
THe build fails with
======================================================================
ERROR: test_bug_checks (test_storage.TestStorageBugChecks)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/.tox/py3/lib/python3.8/site-packages/mock/mock.py", line 1346, in patched
return func(*newargs, **newkeywargs)
File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/tests/unit/test_storage.py", line 641, in test_bug_checks
YBugChecker()()
File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/core/ycheck/__init__.py", line 630, in __call__
return self.run_checks()
File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/core/ycheck/__init__.py", line 620, in run_checks
ret = self.run(self.searchobj.search())
File "/home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/core/ycheck/bugs.py", line 133, in run
pkg_ver = bugsearch['context'].apt_all.get(pkg)
AttributeError: 'NoneType' object has no attribute 'apt_all'
-------------------- >> begin captured logging << --------------------
root: DEBUG: loading bugs definitions for plugin=storage
root: DEBUG: loaded plugin 'storage' bugs - sections=1, events=1
root: DEBUG: bug=38745 path=None
root: DEBUG: no search terms registered so nothing to do
root: DEBUG: requirements provided as groups
root: DEBUG: op=or has 2 requirement(s)
root: DEBUG: calling property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
root: DEBUG: creating filesearcher with max=2 processes
root: DEBUG: path=/tmp/tmp7wfcztu4/tmp9ryz5_2r
root: DEBUG: files=1 searches=1
root: DEBUG: completed searches (results=3)
root: DEBUG: requirement check: property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy <built-in function eq> ['use_some_extra'] (result=True)
root: DEBUG: calling property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
root: DEBUG: creating filesearcher with max=2 processes
root: DEBUG: path=/tmp/tmp7wfcztu4/tmpye383pma
root: DEBUG: files=1 searches=1
root: DEBUG: completed searches (results=3)
root: DEBUG: requirement check: property core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy <built-in function eq> [] (result=False)
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 210 tests in 12.082s
FAILED (errors=1)
ERROR: InvocationError for command /home/nikhil/HDD_MOUNT/Downloads/hotsos_fork/hotsos/.tox/py3/bin/nosetests --verbose tests/unit (exited with code 1)
(I've used the same bug number for both the sections.)
It seems like hotsos expects a context section in the yaml. Looking at core/ycheck/bugs.py at the code thats failing
settings = bugsearch['settings']
if settings and settings.versions_affected and settings.package:
pkg = settings.package
pkg_ver = bugsearch['context'].apt_all.get(pkg) <===
if pkg_ver:
I see in the openstack folder, openstack.yaml contains,
# This file is used to define overrides applicable to contents of this
# directory including subdirectories.
requires:
property: core.plugins.openstack.OpenstackChecksBase.plugin_runnable
context:
apt-all: core.plugins.openstack.OpenstackBase.apt_packages_all
and the OpenstackBase class contains,
class OpenstackBase(object):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.ost_projects = OSTProjectCatalog()
other_pkgs = self.ost_projects.package_dependencies
self.apt_check = checks.APTPackageChecksBase(
core_pkgs=self.ost_projects.packages_core,
other_pkgs=other_pkgs)
self.nova = NovaBase()
self.neutron = NeutronBase()
self.octavia = OctaviaBase()
@property
def apt_packages_all(self): <--
return self.apt_check.all
Would I need something equivalent in a new defs/bugs/storage/ceph.yaml file like the openstack.yaml? Though there is no apt_packages_all defined in Storagebase , or CephChecksBase that I can see.
Ok so, following our latest conversation I have (hopefully) simplified things even further. See https://github.com/canonical/hotsos/commit/9f80f5928ac4ece39ff81494f8208c858a0c73ac but in short, I have removed the content property entirely (so if you have added it to fix your issues above you can remove it again) and I have removed the dependency on the settings property from the bugs handler and extended the apt type in the requires property to support version ranges. So as an example you could write your (second) bug check as follows:
38745:
requires:
and:
- apt:
ceph-osd:
- min: 15.2.6
max: 15.2.10
or:
- property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
value: [use_some_extra]
operator: ne
- property: core.plugins.storage.ceph.CephDaemonConfigShowAllOSDs.bluestore_volume_selection_policy
value: []
operator: ne
raises:
message: >-
Vulnerability to a known bug https://tracker.ceph.com/issues/38745.
Please set bluestore_volume_selection_policy of all OSDs to use_some_extra
Note that in your yaml above you have two checks keyed with ID 38745 and since this translates to a dictionary the second one will overwrite the first. So basically you cant define two checks with the same bug id. Now that we have support for checking package versions in a requires property you could consider implementing the check as a scenario - so a have a bug check for the first check (the one that looks for a log file line) and a scenario check for ^^ one. Or implement both as a scenario, up to you.
Hi @dosaboy , I've implemented the vulnerability as a bug and the detection of the actual log string as a scenario. I left the vulnerability in the bug since I figured I'd need to implement the requirements as properties if I moved it to a scenario. The actual string check in the logs did not need to check any properties so I moved it to scenarios. There's a problem though, since there's no (from what I can see) launchpad bug opened here, so I want to rely on the upstream bug. However, the "id" field in the hotsos output always appends the bug number to the launchpad URL...
So when I test it on a sosreport, I see,
bugs-detected:
- desc: Vulnerability to a known bug https://tracker.ceph.com/issues/38745. Please
set bluestore_volume_selection_policy of all OSDs to use_some_extra
id: https://bugs.launchpad.net/bugs/38745
origin: storage.auto_bug_check
I also see in potential-issues, the scenario check show up properly.. It needs some justify or line wrap to not have the one long line at the end.. shall figure that out
potential-issues:
CephWarnings:
- 'This node has Ceph OSDs running on it but is not using cpufreq scaling_governor
in "performance" mode (actual=ondemand). This is not recommended and can result
in performance degradation. To fix this you can install cpufrequtils and set
"GOVERNOR=performance" in /etc/default/cpufrequtils. You will also need to
disable the ondemand systemd service in order for changes to persist. NOTE:
requires node reboot to take effect. (origin=storage.auto_scenario_check)'
- Known ceph bug https://tracker.ceph.com/issues/38745 detected. See https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
(origin=storage.auto_scenario_check)
I've sent this initial PR just to show you the way I'm doing it.. if it is OK to do it this way then I can start to work on the unit test.
as described in https://tracker.ceph.com/issues/38745 , and case 00326782, hotsos should check for message "BlueFS spillover detected " to check for the rocksdb's level-sized issue.
$ ceph -s cluster: id: 815ea021-7839-4a63-9dc1-14f8c5feecc6 health: HEALTH_WARN BlueFS spillover detected on 1 OSD(s) <--
Also, even if this situation has not been hit, we should check between ceph-osd versions 15.2.6->15.2.10 if the option bluestore_volume_selection_policy is not set to "use_some_extra" because then the vulnerability still exists.
So this is the second check to be made, to check for the vulnerability if the ceph packages fall between 15.2.6 and 15.2.10,
$ sudo ceph daemon osd.1 config show|grep -i policy "bluestore_volume_selection_policy": "use_some_extra",