Excessive failures logged when using `-repeat`

acsysch commented 4 months ago

When syncing continuously using a cmd such as "Unison profile -repeat watch" and a fatal error occurs, unison 2.53.3 issues "restarting in 10 seconds" message and loops back to the fatal error. According to the documentation, one should then issue CTRL+C to exit, while previous versions used to exit in case of an error returning an exit code. This new behaviour isn't usable when using this cmd in a background batch (on Windows in my case). Could you either restore the previous auto-exit on error OR provide an additional switch on the cmdline in order to exit on error?

tleedjarv commented 4 months ago

This is by design. Repeat mode is not intended for use in scripts/batch. What would that even mean?

What is your intention with running in repeat mode (which by definition is non-exiting) in a batch script?

acsysch commented 4 months ago

Well, the intention of such batch is to keep in sync a couple of roots in the background. Of course, the sync should stop upon encountering an error.

In the case we use it for, the 2 roots are located on 2 different servers. The aforementioned batch monitors the connection and, upon sync-error, returns to a waiting loop for the remote server to be back online.

I don’t see any contradiction in using repeat-mode (i.e. a loop) in a batch as long as I can exit that mode on error. And that worked with version 2.53.0. Its manual even stated in chapter 6.10 "The exit status could be useful when Unison is invoked from a script"!

gdt commented 4 months ago

This seems better suited for the mailinglist.

tleedjarv commented 4 months ago

You are welcome to clarify the request in the mailing list but I really don't think you need one.

The aforementioned batch monitors the connection and, upon sync-error, returns to a waiting loop for the remote server to be back online.

It sounds to me that you are using the script to achieve something that is exactly what the repeat mode will do for you. It will continue automatically when the remote end is back online.

I don’t see any contradiction in using repeat-mode (i.e. a loop) in a batch as long as I can exit that mode on error.

Repeat mode is a never-ending process. You may run it under some process monitor (or init system) that monitors if the process itself is alive and restarts it when necessary.

What you're asking for is exactly like running a Dropbox client (or any other cloud drive client) in a script when in fact the client is start-once never-exit process that will be resilient to connection drops. Or any other permanently running client for that matter. You don't need to monitor the connection yourself, you may only want to monitor if the process has died and restart it.

Its manual even stated in chapter 6.10 "The exit status could be useful when Unison is invoked from a script"!

Right, invoke it such that you want it to exit immediately, which repeat mode is not intended to do (but it will, of course, in case it can't continue for some reason).

acsysch commented 4 months ago

Ok, I can agree with you on that matter. But while the watcher runs using the –log parameter and the remote server being offline, unison fills the log with unnecessary information about a fatal error (whereas version 2.53.0 stopped the watch with error code 3):

Unison 2.53.3 (ocaml 4.14.0) log started at 2024-02-09 at 11:16:54

Roots:

//Terre/downloads

//saturne/downloads

Fatal error: Cannot find canonical name of root directory //saturne/downloads

(//saturne/downloads: No such file or directory)

Unison 2.53.3 (ocaml 4.14.0) log started at 2024-02-09 at 11:17:04

Roots:

//Terre/downloads

//saturne/downloads

Fatal error: Cannot find canonical name of root directory //saturne/downloads

(//saturne/downloads: No such file or directory)

Unison 2.53.3 (ocaml 4.14.0) log started at 2024-02-09 at 11:17:33

Roots:

//Terre/downloads

//saturne/downloads

Fatal error: Cannot find canonical name of root directory //saturne/downloads

(//saturne/downloads: No such file or directory)

So there should be either a way to stop the watch on error from a script OR to stop unison from filling the log

Regards

Dominique

De : Tõivo Leedjärv @.*** Envoyé : jeudi, 15 février 2024 15:34 À : bcpierce00/unison Cc : acsysch; Author Objet : Re: [bcpierce00/unison] Switch required to exit textual interface on error when syncing continuously (Issue #996)

You are welcome to clarify the request in the mailing list but I really don't think you need one.

The aforementioned batch monitors the connection and, upon sync-error, returns to a waiting loop for the remote server to be back online.

It sounds to me that you are using the script to achieve something that is exactly what the repeat mode will do for you. It will continue automatically when the remote end is back online.

I don’t see any contradiction in using repeat-mode (i.e. a loop) in a batch as long as I can exit that mode on error.

Repeat mode is a never-ending process. You may run it under some process monitor (or init system) that monitors if the process itself is alive and restarts it when necessary.

What you're asking for is exactly like running a Dropbox client (or any other cloud drive client) in a script when in fact the client is start-once never-exit process that will be resilient to connection drops. Or any other permanently running client for that matter. You don't need to monitor the connection yourself, you may only want to monitor if the process has died and restart it.

Its manual even stated in chapter 6.10 "The exit status could be useful when Unison is invoked from a script"!

Right, invoke it such that you want it to exit immediately, which repeat mode is not intended to do (but it will, of course, in case it can't continue for some reason).

— Reply to this email directly, view it on GitHub https://github.com/bcpierce00/unison/issues/996#issuecomment-1946219999 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AXFY6VK2TEA7E6RHMZN6Q7DYTYMGNAVCNFSM6AAAAABDIWZOPKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBWGIYTSOJZHE . You are receiving this because you authored the thread. https://github.com/notifications/beacon/AXFY6VMBMDQNUYG6UV2VPG3YTYMGNA5CNFSM6AAAAABDIWZOPKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTUAD256.gif Message ID: @.***>

tleedjarv commented 4 months ago

unison fills the log with unnecessary information about a fatal error

I agree, this is something that should be fixed.

gdt commented 4 months ago

Sounds like we should have a new issue for exiting on fatal errors.

tleedjarv commented 4 months ago

Sounds like we should have a new issue for exiting on fatal errors.

Ah, no, that's not it. It's about spamming the log about "fatal" errors when in fact the error is not fatal and the process continues. For example, when the connection drops, in repeat mode it just keeps trying to reconnect.

acsysch commented 4 months ago

That's correct: when a connection drops, the error is marked as fatal.

Earlier v2.53.0 used to exit. V2.53.3 loops on such fatal error instead, resulting in filling the log.

I see 3 possible solutions:

continue to exit on fatal error as in V2.53.0
identify a connection drop as non-fatal and continue watch loop silently without logging
log error on 1st occurrence only

De : Tõivo Leedjärv @.*** Envoyé : lundi, 19 février 2024 21:05 À : bcpierce00/unison Cc : acsysch; Author Objet : Re: [bcpierce00/unison] Switch required to exit textual interface on error when syncing continuously (Issue #996)

Sounds like we should have a new issue for exiting on fatal errors.

Ah, no, that's not it. It's about spamming the log about "fatal" errors when in fact the error is not fatal and the process continues. For example, when the connection drops, in repeat mode it just keeps trying to reconnect.

— Reply to this email directly, view it on GitHub https://github.com/bcpierce00/unison/issues/996#issuecomment-1953095263 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AXFY6VLRK2QJTR7IZXJPKLTYUOV5RAVCNFSM6AAAAABDIWZOPKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGA4TKMRWGM . You are receiving this because you authored the thread. https://github.com/notifications/beacon/AXFY6VJHMPWHNWZJFJJFCPTYUOV5RA5CNFSM6AAAAABDIWZOPKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTUNHPF6.gif Message ID: @.***>

gdt commented 4 months ago

What is fatal in a single sync may not be in repeat. This needs some thought about language. I don't want to suppress diagnostics because you don't want them in the log. failures are failures. If there is too much, we can be terser. More likely, we should have a backoff so that the reconnect attempts become less frequent, up to some reasonable limit, to avoid spamming the remote system. Retrying every 10s and just not logging is not IMHO reasonable.

tleedjarv commented 4 months ago

I'm currently thinking about a solution which would silence repeat errors and be explicit about it. So the user knows that the error condition persists until something else is logged (or the process exits).

tleedjarv commented 3 months ago

@acsysch if you are able to, you may want to try a build from https://github.com/tleedjarv/unison/actions/runs/8377886894 (scroll to bottom) which is 2.53.4 + some patches. The patches are intended to make the logging more reasonable in case of automatic continuous retrying. The log should no longer be spammed with repeated errors, but the UI will still print out all errors on stdout. It would be great if you can test this build and give feedback on its usefulness in your situation.

gdt commented 3 months ago

@acsysch Please don't paste in github notify mails. The quoting is unreadable and there is junk. As it is, github is web only and the emails are just to let you know something happened. I don't really like it, but that's how it is.

I think there are two issues lurking here, related to "what's fatal for any given connection is not fatal for a repeating sync":

we need exponential backoff (roughly) for reconnect attempts, because retrying every 10s is not reasonable to the network and the remote system, in case of persistent problems or misconfiguration
when things fail, we should cache the error and if the next sync fails with that error without ever having connected, suppress it, after logging that errors are being suppressed. Probably, log one very 4h anyway.

I very much lean to fixing the retry behavior before merging logging reduction; it's important for software to be a good network citizen, and silently hammering a remote system with retries is ungood.

What was the actual problem that led to the errors? I think it's helpful to understand that.

gdt commented 3 months ago

I have split the "exponential backoff is missing" to #1009

tleedjarv commented 3 months ago

we need exponential backoff (roughly) for reconnect attempts, because retrying every 10s is not reasonable to the network and the remote system, in case of persistent problems or misconfiguration

Yes, it is not reasonable in case of persistent problems or misconfiguration. But we can't really detect if something is persistent or a misconfiguration. The problem could be at the local end (network issue, ssh (temporarily) fails, disk full, ...) or at the remote end (network issue, sshd not (yet) started, unison binary (temporarily) not found, disk full, ...) and there is no way of saying "we can just stop retrying now, maybe continue tomorrow".

when things fail, we should cache the error and if the next sync fails with that error without ever having connected, suppress it, after logging that errors are being suppressed.

This is what my proposed patch does.

acsysch commented 3 months ago

Hello, I just tested V2.53.4 patched (https://github.com/tleedjarv/unison/actions/runs/8377886894) which looks good as far as the log file is concerned. I agree that retrying every 10 secs in case of failure should at least be configurable via an additional parameter. But I would even go further by adding the option to exit in such case, allowing to identify potential causes by some external means, e.g. within a script. This was in fact the way v2.53.2 worked and was fine for me.

tleedjarv commented 3 months ago

Thank you for testing, I will go ahead an open a PR with this patch.

But I would even go further by adding the option to exit in such case, allowing to identify potential causes by some external means, e.g. within a script. This was in fact the way v2.53.2 worked and was fine for me.

How would this be any different or improve anything?

First, the vast-vast majority of users would not/could not/should not write their own script. They expect the software to do the right thing. In your case, you only had to write the script because the software wasn't doing the right thing.

Second, how would the external means be any better equipped to deal with errors? These means have the exact same problem (likely even worse) of detecting what the error is, understanding what the error condition means and how to resolve it. Instead of Unison retrying every 10 seconds, you would have the script retry every 10 seconds.

Third, you don't have to use the -repeat mode. You can implement the repeating and retry logic by the same external means (which you have already done anyway, as I understand). You have all the choices already today.

To be clear, I'm not completely against making changes in the implementation. Just, before doing that, I would like to know how this would be an improvement and I've yet to hear a single compelling argument.

acsysch commented 3 months ago

You're probably right in saying "users expect the software to do the right thing". Yet some others expect to have a better control over what their script is doing (when using a script). In my case, I know in advance that the remote system isn't always online, so I check its status every 10 minutes before restarting unison which -repeats while the remote system is online. With the new version, unison never ends when using -repeat and, in fact, there's no way to end it except by killing the process (e.g. in case of a lock) which in my view is not the right way. That's why an option to exit in case of a failure looks as an improvement to me, still letting the majority of users not using this feature.

Alternatively, in my use-case, having a parameter to set the retry period in case of failure would have an equivalent value, provided the log would be more explicit with some error code other that just

Retrying continuously. Silencing repeating error messages

V2.53.2 returned error codes 1 to 3 to specify the error type. This error code could be in the log.

So the choice is yours which way you want to go: option to either set the retry period (and adding an error code to the log) OR option to exit in case of a failure. You could of course also to implement neither of the 2 options.

gdt commented 3 months ago

Arguably if unison is being run as a daemon (which -repeat when not invoked by a human at a shell is), then it should be able to write its pid someplace and take a SIGTERM to gracefully exit, and you should be doing that to stop, just as you started.

(IMHO the real problem is that every 10s is way too aggressive to be a reasonable network citizen and I think we need to fix that first; everything else is far less important.)

tleedjarv commented 3 months ago

Correcting some factual errors that have been stated: SIGTERM works, as does SIGINT. SIGUSR2 can be used for even more graceful shutdown.

then it should be able to write its pid someplace

Isn't that the job of your service manager?

tleedjarv commented 3 months ago

@acsysch I obviously don't understand your usage of -repeat. Could you please provide some more details of how your script works? Do you use repeat with a time interval or with the fsmonitor (-repeat watch)? Does the script do anything else or is it just checking the exit code and then checking if the remote end is alive? What does it do with the different exit code values? (note that if you use -repeat then there can realistically only ever be one exit code value)

the log would be more explicit with some error code other that just

Retrying continuously. Silencing repeating error messages

The error message is in the log (that's the last message you see before the silencing). All unique error messages are still logged. There are no error codes, but I think you meant exit code? But there is one single exit code for all errors anyway, so the exit code is completely useless in this scenario.

gdt commented 3 months ago

In the BSD tradition, daemon programs write their pid to /var/run/foo.pid and unlink that on exiting. There isn't really a "service manager" which sounds like a euphemism for systemed. But a service manager needs to get the pid from the daemon somehow; using ps is exactly what is avoided by storing it. However, this is not really the problem as I think that @acsysch could send a signal.

It seems SIGTERM is appropriate if the network has failed, vs waiting for the last sync to finish.

tleedjarv commented 3 months ago

Yes, SIGTERM is the most appropriate when the retry loop is executing. Even SIGKILL would be safe (but there's a small risk that the connection was just restored by the time the signal is sent).

No, "service manager" was not meant as a reference to systemd specifically, but yes, that is one example. The other example could be SMF from Solaris, and there is something similar in Windows-land but I'm not familiar with that.

acsysch commented 3 months ago

I'm using a windows system and the only way to stop a running program is to get its PID and kill the process.

acsysch commented 3 months ago

@acsysch Do you use repeat with a time interval or with the fsmonitor (-repeat watch)? Does the script do anything else or is it just checking the exit code and then checking if the remote end is alive? What does it do with the different exit code values? (note that if you use -repeat then there can realistically only ever be one exit code value)

Yes, I'm using -repeat watch (using v2.53.2). In either case, I enter my loop waiting for the remote system's availability after checking whether there's a filelock (existence of a lk*-file).

gdt commented 3 months ago

I'm declaring this issue as an enhancement request to somehow reduce the amount of log messages -- while also not hiding from users that failures are happening at a rate I consider excessive. That likely means it is blocked until backoff is implemented. Neither of the people actively contributing to unison intend to spend time on this, so I've marked it wontfix.

(Talking about how to start/stop unison on network coming and going seems best suited for the user's list.)

tleedjarv commented 3 months ago

Heh, I just opened a PR :)

gdt commented 2 months ago

After a bunch of discussion, I think we've arrived at:

repeat mode needs backoff (#1009)
repeat mode raises a lot of really hard issues, in that one can't tell if a failure is likely to persist or not and I have arrived at
I am not comfortable hiding failures from users while continuing to make them happen

so for now, I don't want to merge any "reduce duplicate logs", and instead would like to see progress on fundamentals (dealing with errors better) and rate control (exponential backoff). We can then see what makes sense for logging, but I am guessing I am going to think it's ok.

We've had a good discussion that's been a bit far ranging - but it has really helped clarify thinking. I'm feeling like it's time to close this issue, essentially saying that the bug is the rate of failed attempts, not that they are logged.

gdt commented 2 months ago

2.53.5 stops retry, and #1009 remains open. It's unclear if we will have any of a fix for #1009, some kind of sorting errors into persistent/not, and re-enabling. But I've decided that I don't want to suppress errors, standing by my comment from 26 April.

bcpierce00 / unison

Excessive failures logged when using `-repeat` #996