repeat mode lacks backoff

gdt commented 6 months ago

When there is a failure, -repeat retries. That's fine, but retrying every 10s forever is not ok. Just like mail delivery, repeat should have some kind of backoff. Exactly what is not critical, but the rate of traffic and remote log entries should approach a very small number per day.

I suggest either

a max of 3 tries at each of 10s, 1 min, 10 min, 1h, and then (indefinitely) every 8h, or
pure exponential capped at 8h.

I'm omitting as out of scope an enhancement of "trigger a single retry on notification that network has transitioned from not working to working". That's hard in the first place, and even harder to hook up portably.

tleedjarv commented 6 months ago

For the record, I don't think this is the right way to go. Why would backoff be needed, what problem is it solving?

I don't think there is a problem that requires backoff and I think introducing backoff in this scenario makes the software behave unacceptably and against user expectation.

What did the user want to achieve when running in repeat mode? I don't think it is anything like mail delivery where the action is inherently asynchronous and you a) are possibly bombarding someone else's system, among hundreds or thousands of others; b) have no control over the remote end and potentially over the source; c) are trying hundreds or even thousands of separate actions (vs only one); and d) the retries are possibly heavy on your resources and affecting normal operations. None of which apply to the -repeat mode.

Here we're assuming the user is in full control over the local and remote end (I doubt there are too many users syncing to a third party remote service) and there is a very small number of clients against one server. If a user set a specific sync interval then that's presumably because they tolerate being out of sync within that timeframe. If that time suddenly (and silently) becomes hours or days longer, due to a temporarily flaky connection, that is not right and that is not what the user would want.

It's even worse in the fsmonitor mode. Then the user has indicated that they want to sync constantly, as soon as the changes happen. If due to a temporary connection drop there suddenly is an 8h pause in syncing then I'd say the software has failed completely. If the remote end was rebooted intentionally or unintentionally, then again, it could become out of sync for 8h.

Thinking about and constant near-immediate sync scenarios like "cloud" disk services, such as Dropbox, and Windows functionaly such as Offline Files, I think it is absolutely unforgivable if temporary connection drops would cause hours long periods of being out of sync.

it's important for software to be a good network citizen, and silently hammering a remote system with retries is ungood.

Except there is no hammering anything if the connection itself fails (and I'd argue it's not "hammering" even if the connection succeeds). Also, it is not any remote system, definitely not a third party system.

The current retry interval of 10 seconds is arbitrary. It could very well be 2 seconds or 5 seconds or 15 seconds. I don't think it can be much longer than that, perhaps 30 seconds max.

gdt commented 6 months ago

I meant backoff for when trying to initiate a connection and failing. I did not mean it to apply during operation where local unison is talking to remote unison.

When a connection is successful then things are reset, so a drop leads to a 10s pause. I mean only when there are repeated failures in a row.

If one is getting permission denied, or unison not found, or ICMP admin prohibited, then it isn't reasonable to be re-attempting every 10s forever. Even if that just results in a log entry on the remote system.

Who is the sysadmin for the 3rd-party system is unclear. It seems reasonably likely that one system is the user's computer at home and the other a computer at a university or work, which might have sysadmins, and who might be obligated to review access control logs (e.g. NIST 800-171).

Are you really saying that if 50 times in a row the 10s time fires and every single "ssh remote unison" fails completely, that this should continue every 10s?

tleedjarv commented 6 months ago

Are you really saying that if 50 times in a row the 10s time fires and every single "ssh remote unison" fails completely, that this should continue every 10s?

Yes. But not because it is a smart thing to do. It's because I don't see how else to do it if we can't classify errors. A server restart can take several minutes and during that time every single "ssh remote unison" will fail completely. A network outage could laste several minutes, or even hours, yet I'd expect the sync to continue as soon as the network is back up.

If there is a better way then I'm all for it. I just think that backoff is contrary to user expectation and in extreme cases may make matters even worse.

gdt commented 6 months ago

In my experience all network protocols have backoff on error, and this is fundamental to the Internet's design, and necessary to avoid congestion collapse. IMHO as long as the recovery is more or less no longer than the outage, it's ok. email delivery, as an example of something that has a similar user expectation of immediate, is exactly like this.

bcpierce00 / unison

repeat mode lacks backoff #1009