SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
160 stars 75 forks source link

SES5: Fixes no UCST response after 1 seconds #1879

Closed p-se closed 3 years ago

p-se commented 3 years ago

Although this code seems to have been working for about 5 years, it doesn't do so anymore for me. I'm not sure what happened, but a second is insufficient now, and that reproducibly.

When changing the time to wait for a response up to two seconds, it works.

When the -t <int> argument is removed, a default of 5 seconds is used, so I decided to go with that.

Signed-off-by: Patrick Seidensal pseidensal@suse.com


So even if it turns out to be a local issue and we won't merge this change (or possibly won't create another DeepSea package to be released for SES 5), it may serve me to look the issue up if I forgot how to circumvent it or might serve as a documentation for others, who might run into it.

Note that I'm currently unable to simply sesdev create ses5 --single-node an SES5 cluster. At stage 3 I've got to ssh into the machine and adapt DeepSeas' code as contained in this PR, just to be able to rerun stage 3 manually, followed by having to run stage 4 manually.


The problem did not occur when deploying (single-node) SES6 or SES7.

The command to deploy was:

sesdev create ses5 --single-node -n
[...]

    master: Stage initialization output:
    master: firewall                 : disabled
    master: apparmor                 : disabled
    master: DEV_ENV                  : True
    master: fsid                     : valid
    master: public_network           : valid
    master: cluster_network          : valid
    master: cluster_interface        : valid
    master: monitors                 : valid
    master: mgrs                     : valid
    master: storage                  : valid
    master: rgw                      : valid
    master: ganesha                  : valid
    master: master_role              : valid
    master: fqdn                     : valid
    master: time_server              : ['master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds']
    master: [init] Executing runner ready.check... 
    master: 
    master:          |_  cmd.shell(/usr/sbin/iptables -S)... 
    master:                in master.ses5-mini.test... ok
    master:          |_  cmd.shell(/usr/sbin/aa-status --enabled 2>/dev/null; echo $?)... 
    master:                in master.ses5-mini.test... ok
    master:          |_  state.apply(ceph.apparmor.profiles, test=true, __kwarg__=True)... 
    master:                in master.ses5-mini.test... ok
    master: [init] Executing runner ready.check... ok
    master: [init] Executing runner changed.mon... 
    master: ok
    master: [init] Executing runner changed.osd... 
    master: ok
    master: [init] Executing runner changed.mgr... 
    master: ok
    master: [init] Executing runner changed.config... 
    master: ok
    master: [init] Executing runner changed.client... 
    master: ok
    master: [init] Executing runner changed.global... 
    master: ok
    master: [init] Executing runner changed.mds... 
    master: ok
    master: [init] Executing runner changed.igw... 
    master: ok
    master: [init] Executing runner deepsea_minions.matches... 
    master: ok
    master: [init] Executing runner select.minions... 
    master: ok
    master: [init] Executing runner select.minions... 
    master: ok
    master: [init] Executing runner select.minions... 
    master: ok
    master: [init] Executing runner select.minions... 
    master: ok
    master: [init] Executing runner select.minions... 
    master: ok
    master: Stage execution failed: 
    master:   - validate failed
    master: Error in sesdev deployment script trapped!
    master: +++ err_report 645
    master: +++ set +x
    master: => hostname: master
    master: => script: /tmp/vagrant-shell
    master: => line number: 645
    master: Bailing out!
==> master: Configuring proxy for Git...
=== Deployment Finished ===

You can login into the cluster with:

  $ sesdev ssh ses5-mini

Or, access openATTIC with:

  $ sesdev tunnel ses5-mini openattic

Fixes #

Description:


Checklist:

smithfarm commented 3 years ago

Changing the sntp timeout from 1 second to 5 seconds doesn't seem like it should do any harm.

FWIW I do not reproduce this behavior here: sesdev create ses5 --single-node completes successfully for me.

tserong commented 3 years ago

WTF? I just tried sesdev create ses5 --single-node and had the same problem (['master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds']), but if I login to the VM later and run sntp -t 1 127.0.0.1 it returns immediately just fine. Uh... Maybe the NTPD startup/sync is somehow laggy...? But that doesn't explain why @smithfarm didn't hit this.

Let me just do another couple of tests...

tserong commented 3 years ago

OK, so this thing is just freakish.

In my testing, if I run /usr/sbin/sntp -t 1 127.0.0.1 from a shell, it always returns immediately, e.g.:

# time /usr/sbin/sntp -t 1 127.0.0.1
sntp 4.2.8p15@1.3728-o Thu Jun 25 14:17:23 UTC 2020 (1)
2021-06-18 12:29:18.893530 (+0000) +0.000032 +/- 0.142808 127.0.0.1 s3 no-leap

real    0m0.008s
user    0m0.007s
sys 0m0.000s

If run from salt (as part of the validate runner in stage 3), it usually fails:

# for i in $(seq 1 10) ; do DEV_ENV=true salt-run validate.pillar cluster=ceph 2>&1 |grep -q '^True' && echo "OK" || echo "failed" ; done
failed
failed
failed
failed
OK
failed
failed
failed
failed
failed

journalctl -u run at the same time shows an obvious lag of a couple of seconds in sntp:

Jun 18 12:32:39 master sntp[5298]: sntp 4.2.8p15@1.3728-o Thu Jun 25 14:17:23 UTC 2020 (1)
Jun 18 12:32:39 master sntp[5298]: Can't open KOD db file /var/lib/ntp/kod for writing: Permission denied
Jun 18 12:32:41 master sntp[5298]: master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds

Now, wait a minute... See how that says "master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds"? I was testing the wrong thing by hitting 127.0.0.1, I should have tried it by hostname:

# time /usr/sbin/sntp -t 1 master.ses5-mini.test
sntp 4.2.8p15@1.3728-o Thu Jun 25 14:17:23 UTC 2020 (1)
master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds

real    0m1.787s
user    0m0.006s
sys 0m0.000s

Sure enough, that takes way longer. I'd be willing to bet this is actually a DNS problem (name resolution lag?).

I'm still surprised we never hit this before, and also haven't hit it on SES6, but I'm still happy to take the fix, and will apply the same to SES6 for completeness.

smithfarm commented 3 years ago

Yes, that sounds very much like name-resolution lag.