Closed p-se closed 3 years ago
Changing the sntp timeout from 1 second to 5 seconds doesn't seem like it should do any harm.
FWIW I do not reproduce this behavior here: sesdev create ses5 --single-node
completes successfully for me.
WTF? I just tried sesdev create ses5 --single-node
and had the same problem (['master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds']
), but if I login to the VM later and run sntp -t 1 127.0.0.1
it returns immediately just fine. Uh... Maybe the NTPD startup/sync is somehow laggy...? But that doesn't explain why @smithfarm didn't hit this.
Let me just do another couple of tests...
OK, so this thing is just freakish.
In my testing, if I run /usr/sbin/sntp -t 1 127.0.0.1
from a shell, it always returns immediately, e.g.:
# time /usr/sbin/sntp -t 1 127.0.0.1
sntp 4.2.8p15@1.3728-o Thu Jun 25 14:17:23 UTC 2020 (1)
2021-06-18 12:29:18.893530 (+0000) +0.000032 +/- 0.142808 127.0.0.1 s3 no-leap
real 0m0.008s
user 0m0.007s
sys 0m0.000s
If run from salt (as part of the validate runner in stage 3), it usually fails:
# for i in $(seq 1 10) ; do DEV_ENV=true salt-run validate.pillar cluster=ceph 2>&1 |grep -q '^True' && echo "OK" || echo "failed" ; done
failed
failed
failed
failed
OK
failed
failed
failed
failed
failed
journalctl -u
run at the same time shows an obvious lag of a couple of seconds in sntp:
Jun 18 12:32:39 master sntp[5298]: sntp 4.2.8p15@1.3728-o Thu Jun 25 14:17:23 UTC 2020 (1)
Jun 18 12:32:39 master sntp[5298]: Can't open KOD db file /var/lib/ntp/kod for writing: Permission denied
Jun 18 12:32:41 master sntp[5298]: master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds
Now, wait a minute... See how that says "master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds"? I was testing the wrong thing by hitting 127.0.0.1, I should have tried it by hostname:
# time /usr/sbin/sntp -t 1 master.ses5-mini.test
sntp 4.2.8p15@1.3728-o Thu Jun 25 14:17:23 UTC 2020 (1)
master.ses5-mini.test 127.0.1.1 no UCST response after 1 seconds
real 0m1.787s
user 0m0.006s
sys 0m0.000s
Sure enough, that takes way longer. I'd be willing to bet this is actually a DNS problem (name resolution lag?).
I'm still surprised we never hit this before, and also haven't hit it on SES6, but I'm still happy to take the fix, and will apply the same to SES6 for completeness.
Yes, that sounds very much like name-resolution lag.
Although this code seems to have been working for about 5 years, it doesn't do so anymore for me. I'm not sure what happened, but a second is insufficient now, and that reproducibly.
When changing the time to wait for a response up to two seconds, it works.
When the
-t <int>
argument is removed, a default of 5 seconds is used, so I decided to go with that.Signed-off-by: Patrick Seidensal pseidensal@suse.com
So even if it turns out to be a local issue and we won't merge this change (or possibly won't create another DeepSea package to be released for SES 5), it may serve me to look the issue up if I forgot how to circumvent it or might serve as a documentation for others, who might run into it.
Note that I'm currently unable to simply
sesdev create ses5 --single-node
an SES5 cluster. At stage 3 I've got to ssh into the machine and adapt DeepSeas' code as contained in this PR, just to be able to rerun stage 3 manually, followed by having to run stage 4 manually.The problem did not occur when deploying (single-node) SES6 or SES7.
The command to deploy was:
Fixes #
Description:
Checklist: