Closed swiftgist closed 4 years ago
May I ask why you think 10 minutes too long? The waitsync has been known to take several minutes to complete successfully.
The purpose of syncing the clocks is two-fold:
To me, 10 minutes does not seem like a very long time for a command that is expected to only run once? The argument for waiting up to 10 minutes is that we need to be fairly certain that there really is a problem with the external time server specified by the user before failing ceph-salt apply
because we think it doesn't work.
Maybe we can do some sanity checks on the external time server(s) specified by the user, before triggering the clock sync script? Then if someone specifies e.g. "pool.npt.org" (i.e. a typo), we could fail much quicker.
May I ask why you think 10 minutes too long? The waitsync has been known to take several minutes to complete successfully.
For the general case, ten minutes is a long time for failure when most/many users may simply rerun ceph-salt
without investigation to see if the problem is temporary. With further investigation and running the script directly, the additional runs do add up.
Is your experience with the "known to take several minutes" only in the testing environment? I would be surprised in general that chronyc makestep
is insufficient.
Maybe we can do some sanity checks on the external time server(s) specified by the user, before triggering the clock sync script? Then if someone specifies e.g. "pool.npt.org" (i.e. a typo), we could fail much quicker.
For the most part, the fail fast is needed. I honestly haven't figured out why I have exactly one minion out of 10 that failed consistently even with restarts of chronyd
on both that minion and the admin node.
I don't know if you would consider configurable tries if there is some systems that require a while for chronyc waitsync
. As you can see from the output, no correction or skew was happening.
Is your experience with the "known to take several minutes" only in the testing environment?
I'm confused by this question: ceph-salt is not used in production anywhere, yet, so I'd call all our environments "testing".
What I have seen is that the wait-sync can take quite a long time if run (as we are doing it here) soon after the chrony service is brought up for the first time. Once the chrony service has been running for awhile, wait-sync typically becomes very fast. When it is in this very early state, the output from wait-sync is exactly what you posted:
try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 2, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 3, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 4, refid: 00000000, correction: 0.000000000, skew: 0.000
...
and, at some point (can even take minutes), it suddenly starts reporting real values instead of "0.000" and, soon after that, it syncs.
As to why your one minion out of ten is having trouble syncing, can you paste here the ceph-salt config
commands you are using to configure time sync?
Since I'm using sesdev, the commands I'm using are:
ceph-salt config /time_server/server_hostname set FQDN_OF_MASTER_NODE
ceph-salt config /time_server/external_servers add pool.ntp.org
With this configuration, the master node syncs to pool.ntp.org first while all the other minions wait. Then all the remaining (non-master) minions sync to the master node, which goes very quickly since there is no network distance between them.
Originally I had "wait-sync 30" (5 minute timeout), but was forced to increase this to 60 (10 minutes) because the master-node would still occasionally not sync within 5 minutes. After increasing the wait-sync timeout to 10 minutes, I have not seen a failure.
Is your experience with the "known to take several minutes" only in the testing environment?
I'm confused by this question: ceph-salt is not used in production anywhere, yet, so I'd call all our environments "testing".
My point is that this will be production code and that the priority should be production environments. Telling customers to sit out timeouts because of testing environments will not end well.
What I have seen is that the wait-sync can take quite a long time if run (as we are doing it here) soon after the chrony service is brought up for the first time. Once the chrony service has been running for awhile, wait-sync typically becomes very fast. When it is in this very early state, the output from wait-sync is exactly what you posted:
try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000 try: 2, refid: 00000000, correction: 0.000000000, skew: 0.000 try: 3, refid: 00000000, correction: 0.000000000, skew: 0.000 try: 4, refid: 00000000, correction: 0.000000000, skew: 0.000 ...
and, at some point (can even take minutes), it suddenly starts reporting real values instead of "0.000" and, soon after that, it syncs.
As to why your one minion out of ten is having trouble syncing, can you paste here the
ceph-salt config
commands you are using to configure time sync?Since I'm using sesdev, the commands I'm using are:
ceph-salt config /time_server/server_hostname set FQDN_OF_MASTER_NODE ceph-salt config /time_server/external_servers add pool.ntp.org
Mine is similar:
ceph-salt config /time_server/server_hostname set admin.ceph
ceph-salt config /time_server/external_servers add 0.pool.ntp.org
With this configuration, the master node syncs to pool.ntp.org first while all the other minions wait. Then all the remaining (non-master) minions sync to the master node, which goes very quickly since there is no network distance between them.
Originally I had "wait-sync 30" (5 minute timeout), but was forced to increase this to 60 (10 minutes) because the master-node would still occasionally not sync within 5 minutes. After increasing the wait-sync timeout to 10 minutes, I have not seen a failure.
I am able to work around this in my environment. My point of creating the issue is that failures can happen where the waitsync still times out. When that happens, a user will just see the message that the shell script failed. What do you consider the workaround to be if a user or customer cannot get waitsync to succeed?
My point is that this will be production code and that the priority should be production environments.
I completely agree, and that's why we're implementing the code carefully and testing it intensively to maximize the probability that it won't fail in production. Though I wrote it once, I guess it's worth repeating:
"What I have seen is that the wait-sync can take quite a long time if run (as we are doing it here) soon after the chrony service is brought up for the first time."
Now, am I understanding correctly that you have a node where wait-sync never succeeds, no matter what you do? Is this the time server node itself (which is configured to get time from 0.pool.ntp.org
) or one of the nodes that is configured to get its time from the time server node?
My point of creating the issue is that failures can happen where the waitsync still times out. When that happens, a user will just see the message that the shell script failed.
I'd like to learn more about these failures. Can they be reproduced, analyzed? If so, then we can code around them.
What do you consider the workaround to be if a user or customer cannot get waitsync to succeed?
Well, the time server configuration in ceph-salt is optional, so if -- despite our best efforts -- it turns out to be hopelessly broken in the user's environment, I'd say the workaround is to unconfigure it.
I wonder if time sync is taking so long because a "misconfigured" chronyd was already running before ceph-salt apply
execution, and ceph-salt apply
didn't restarted the service to reload the new /etc/chrony.conf
writen by ceph-salt
.
If this was the case, the following PR fixes the issue: https://github.com/ceph/ceph-salt/pull/407
Closing the issue for now, feel free to reopen if problem persists.
Ten minutes is too long for a failure
I suggest setting the command to
chronyc waitsync 6 0.04
or removing the waitsync since the previouschronyc makestep
can fail already.