ceph / ceph-salt

Deploy Ceph clusters using cephadm
MIT License
31 stars 20 forks source link

Excessive timeout on failed time sync #241

Closed swiftgist closed 4 years ago

swiftgist commented 4 years ago

Ten minutes is too long for a failure

# time /tmp/chrony_sync_clock.sh 
200 OK
200 OK
try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 2, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 3, refid: 00000000, correction: 0.000000000, skew: 0.000
...
try: 59, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 60, refid: 00000000, correction: 0.000000000, skew: 0.000

real    10m5.624s
user    0m0.011s
sys     0m0.029s

I suggest setting the command to chronyc waitsync 6 0.04 or removing the waitsync since the previous chronyc makestep can fail already.

smithfarm commented 4 years ago

May I ask why you think 10 minutes too long? The waitsync has been known to take several minutes to complete successfully.

The purpose of syncing the clocks is two-fold:

  1. to ensure that the cluster does not come up in HEALTH_WARN due to clock skew
  2. to flag time sync issues before the cluster gets deployed

To me, 10 minutes does not seem like a very long time for a command that is expected to only run once? The argument for waiting up to 10 minutes is that we need to be fairly certain that there really is a problem with the external time server specified by the user before failing ceph-salt apply because we think it doesn't work.

smithfarm commented 4 years ago

Maybe we can do some sanity checks on the external time server(s) specified by the user, before triggering the clock sync script? Then if someone specifies e.g. "pool.npt.org" (i.e. a typo), we could fail much quicker.

swiftgist commented 4 years ago

May I ask why you think 10 minutes too long? The waitsync has been known to take several minutes to complete successfully.

For the general case, ten minutes is a long time for failure when most/many users may simply rerun ceph-salt without investigation to see if the problem is temporary. With further investigation and running the script directly, the additional runs do add up.

Is your experience with the "known to take several minutes" only in the testing environment? I would be surprised in general that chronyc makestep is insufficient.

swiftgist commented 4 years ago

Maybe we can do some sanity checks on the external time server(s) specified by the user, before triggering the clock sync script? Then if someone specifies e.g. "pool.npt.org" (i.e. a typo), we could fail much quicker.

For the most part, the fail fast is needed. I honestly haven't figured out why I have exactly one minion out of 10 that failed consistently even with restarts of chronyd on both that minion and the admin node.

I don't know if you would consider configurable tries if there is some systems that require a while for chronyc waitsync. As you can see from the output, no correction or skew was happening.

smithfarm commented 4 years ago

Is your experience with the "known to take several minutes" only in the testing environment?

I'm confused by this question: ceph-salt is not used in production anywhere, yet, so I'd call all our environments "testing".

What I have seen is that the wait-sync can take quite a long time if run (as we are doing it here) soon after the chrony service is brought up for the first time. Once the chrony service has been running for awhile, wait-sync typically becomes very fast. When it is in this very early state, the output from wait-sync is exactly what you posted:

try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 2, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 3, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 4, refid: 00000000, correction: 0.000000000, skew: 0.000
...

and, at some point (can even take minutes), it suddenly starts reporting real values instead of "0.000" and, soon after that, it syncs.

As to why your one minion out of ten is having trouble syncing, can you paste here the ceph-salt config commands you are using to configure time sync?

Since I'm using sesdev, the commands I'm using are:

ceph-salt config /time_server/server_hostname set FQDN_OF_MASTER_NODE
ceph-salt config /time_server/external_servers add pool.ntp.org

With this configuration, the master node syncs to pool.ntp.org first while all the other minions wait. Then all the remaining (non-master) minions sync to the master node, which goes very quickly since there is no network distance between them.

Originally I had "wait-sync 30" (5 minute timeout), but was forced to increase this to 60 (10 minutes) because the master-node would still occasionally not sync within 5 minutes. After increasing the wait-sync timeout to 10 minutes, I have not seen a failure.

swiftgist commented 4 years ago

Is your experience with the "known to take several minutes" only in the testing environment?

I'm confused by this question: ceph-salt is not used in production anywhere, yet, so I'd call all our environments "testing".

My point is that this will be production code and that the priority should be production environments. Telling customers to sit out timeouts because of testing environments will not end well.

What I have seen is that the wait-sync can take quite a long time if run (as we are doing it here) soon after the chrony service is brought up for the first time. Once the chrony service has been running for awhile, wait-sync typically becomes very fast. When it is in this very early state, the output from wait-sync is exactly what you posted:

try: 1, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 2, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 3, refid: 00000000, correction: 0.000000000, skew: 0.000
try: 4, refid: 00000000, correction: 0.000000000, skew: 0.000
...

and, at some point (can even take minutes), it suddenly starts reporting real values instead of "0.000" and, soon after that, it syncs.

As to why your one minion out of ten is having trouble syncing, can you paste here the ceph-salt config commands you are using to configure time sync?

Since I'm using sesdev, the commands I'm using are:

ceph-salt config /time_server/server_hostname set FQDN_OF_MASTER_NODE
ceph-salt config /time_server/external_servers add pool.ntp.org

Mine is similar:

ceph-salt config /time_server/server_hostname set admin.ceph
ceph-salt config /time_server/external_servers add 0.pool.ntp.org

With this configuration, the master node syncs to pool.ntp.org first while all the other minions wait. Then all the remaining (non-master) minions sync to the master node, which goes very quickly since there is no network distance between them.

Originally I had "wait-sync 30" (5 minute timeout), but was forced to increase this to 60 (10 minutes) because the master-node would still occasionally not sync within 5 minutes. After increasing the wait-sync timeout to 10 minutes, I have not seen a failure.

I am able to work around this in my environment. My point of creating the issue is that failures can happen where the waitsync still times out. When that happens, a user will just see the message that the shell script failed. What do you consider the workaround to be if a user or customer cannot get waitsync to succeed?

smithfarm commented 4 years ago

My point is that this will be production code and that the priority should be production environments.

I completely agree, and that's why we're implementing the code carefully and testing it intensively to maximize the probability that it won't fail in production. Though I wrote it once, I guess it's worth repeating:

"What I have seen is that the wait-sync can take quite a long time if run (as we are doing it here) soon after the chrony service is brought up for the first time."

Now, am I understanding correctly that you have a node where wait-sync never succeeds, no matter what you do? Is this the time server node itself (which is configured to get time from 0.pool.ntp.org) or one of the nodes that is configured to get its time from the time server node?

My point of creating the issue is that failures can happen where the waitsync still times out. When that happens, a user will just see the message that the shell script failed.

I'd like to learn more about these failures. Can they be reproduced, analyzed? If so, then we can code around them.

What do you consider the workaround to be if a user or customer cannot get waitsync to succeed?

Well, the time server configuration in ceph-salt is optional, so if -- despite our best efforts -- it turns out to be hopelessly broken in the user's environment, I'd say the workaround is to unconfigure it.

ricardoasmarques commented 4 years ago

I wonder if time sync is taking so long because a "misconfigured" chronyd was already running before ceph-salt apply execution, and ceph-salt apply didn't restarted the service to reload the new /etc/chrony.conf writen by ceph-salt.

If this was the case, the following PR fixes the issue: https://github.com/ceph/ceph-salt/pull/407

Closing the issue for now, feel free to reopen if problem persists.