SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

runners/upgrade: disable time_server in ceph_salt_config (bsc#1177607) #1855

Closed tserong closed 4 years ago

tserong commented 4 years ago

This commit changes the exported ceph-salt config to set time_server disabled, which means ceph-salt apply will not make any changes to whatever existing time server configuration is in place.

Signed-off-by: Tim Serong tserong@suse.com

Fixes: https://bugzilla.suse.com/show_bug.cgi?id=1177607

tserong commented 4 years ago

This should be right, but I'm currently half way through an upgrade, so will confirm shortly.

tserong commented 4 years ago

Wait a minute... I finally got through a manual test and ended up with this:

admin:~ # ceph-salt apply
Syncing minions with the master...
Checking if minions respond to ping...
Pinging 9 minions...
Checking if ceph-salt formula is available...
Checking if minions have functioning DNS...
Running DNS lookups on 9 minions...
Checking if there is an existing Ceph cluster...
Ceph cluster already exists
/time_server is disabled. Will check if minions have a time_sync service enabled and running...
Checking time sync service on 9 minions...
Time sync issues detected on host(s) admin.ceph
/time_server is disabled. In that case, a time sync service must be enabled and running on all minions. Please fix this issue and try again.
tserong commented 4 years ago

/var/log/ceph-salt.log says

2020-10-13 11:48:54,021 [INFO] [ceph_salt.execute] probe_time_sync returned: {'data3.ceph': True, 'mon2.ceph': True, 'data2.ceph': True, 'data4.ceph': True, 'data5.ceph': True, 'mon3.ceph': True, 'data1.ceph': True, 'mon1.ceph': True, 'admin.ceph': False}
2020-10-13 11:48:54,021 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host data3.ceph
2020-10-13 11:48:54,021 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host mon2.ceph
2020-10-13 11:48:54,021 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host data2.ceph
2020-10-13 11:48:54,021 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host data4.ceph
2020-10-13 11:48:54,022 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host data5.ceph
2020-10-13 11:48:54,022 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host mon3.ceph
2020-10-13 11:48:54,022 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host data1.ceph
2020-10-13 11:48:54,022 [INFO] [ceph_salt.execute] Time sync service is enabled and running on host mon1.ceph
2020-10-13 11:48:54,022 [ERROR] [ceph_salt.execute] Time sync service is NOT enabled and running on host admin.ceph

...but maybe that makes sense? In my case, admin.ceph actually is the time server for this test cluster, so, um, does that mean the check in ceph-salt is incorrect? Or did DeepSea misconfigure admin.ceph? Or is there some other problem? @ricardoasmarques @swiftgist

smithfarm commented 4 years ago

@tserong Here is the Salt module that is run to determine whether time sync is enabled and running on a given minion:

def probe_time_sync():
    units = [ 
        'chrony.service',  # 18.04 (at least)
        'chronyd.service', # el / opensuse
        'systemd-timesyncd.service',
        'ntpd.service', # el7 (at least)
        'ntp.service',  # 18.04 (at least)
    ]   
    if not _check_units(units):
        log_msg = ('No time sync service is running; checked for: '
                   .format(', '.join(units)))
        log.warning(log_msg)
        return False
    return True

What you saw indicates that this module returned "False" when it ran on admin.ceph. You should be able to see more details in the admin.ceph minion log.

The motivation for having this check is that mgr/cephadm itself will refuse to manage a node that does not pass this check.

smithfarm commented 4 years ago

Whether the check is "correct" or not is a good question. If admin.ceph is NOT going to be managed by cephadm, then one could make an argument that this check does not apply to it. But right now the check is applied to all the minions that have the ceph-salt:member grain.

Here are two ways this problem could be averted:

(1) add options to ceph-salt apply to make it skip certain checks if we have reason to fear that they might be too strict.

(2) have ceph-salt apply skip the time sync check if it detects a running ceph cluster.

smithfarm commented 4 years ago

At any rate, I think the error @tserong encountered is not related to this PR. I figure we don't want ceph-salt to re-do the battle-tested production time sync setup from SES6 in any case.

tserong commented 4 years ago

At any rate, I think the error @tserong encountered is not related to this PR. I figure we don't want ceph-salt to re-do the battle-tested production time sync setup from SES6 in any case.

Agreed. In my case, it turns out admin.ceph simply wasn't running chrony, even though it's meant to be the time server for my cluster, but it turns out that's actually me misconfiguring things :-/ because the SES6 deployment guide pretty clearly states that one needs to "Verify that the time synchronization service is enabled on each system start-up" (step 11 under https://documentation.suse.com/ses/6/html/ses-all/ceph-install-saltstack.html#ceph-install-stack)

smithfarm commented 4 years ago

@susebot run teuthology

susebot commented 4 years ago

Commit c8b24b2867df0abcd2dcb504a8ab3d6c29f837da is OK for suite deepsea:tier2. Check tests results in the Jenkins job: https://storage-ci.suse.de/job/pr-deepsea/484/