Open AdamWill opened 1 month ago
For context, see https://bugzilla.redhat.com/show_bug.cgi?id=2316066 . In Fedora 41 validation testing we found a problem where live images sometimes weren't working properly for the first minute or so after boot. We worked out that this is associated with a system time jump that often happens during boot of live images on systems with the hardware clock set to UTC. Our live images assume the hardware clock is set to localtime and the system timezone is US Eastern, so they set the system clock accordingly; when chrony kicks in during boot and finds the real current time, this can cause a system clock jump of several hours. If we adjust things so no clock jump occurs, the bug doesn't happen.
Steve tracked it down to the org.gnome.SettingsDaemon.PrintNotifications.service user service timing out on startup, which blocked some other things from starting up properly too, especially xdg-desktop-portal-gnome.service , which is likely the source of the observed problems. I then noticed the backtrace of that service (after systemd kills it for exceeding its startup timeout) shows it apparently waiting for a response from cupsGetDests2
, which calls cups_enum_dests
, so I went to look at that codepath and found this, which looks a lot like it could cause the problem we're observing.
I built a live image with a scratch build of our cups package with this patch applied (backported to 2.4.11), and in some initial testing by Steve it seems to have resolved the problem.
I do note there's a lot of other calls to gettimeofday
in CUPS, some of which do seem to be used for 'how much time has elapsed' comparisons like this one, so there may be other bugs like this lurking. But I didn't want to spread this change out too far initially.
Yes, this needs cleanup and we need to determine the proper functions to use on Windows for the same purpose (if the current code isn't suitable). I also want to check on how long this has been supported as some of these POSIX functions are "standard" but with loads of optional bits - I have a vague memory of Solaris not supporting the monotonic clock, among others, so we'll need to research this a bit more before we "pull the trigger"...
Roger, those all sound like excellent improvements. I'll try and get back and work on this once we're through Fedora 41 final freeze (which is where this came from). FWIW though I'd say this is a bug fix, not an enhancement. To be clear, I found this as the cause of a fairly visible bug in Fedora - https://bugzilla.redhat.com/show_bug.cgi?id=2316066 . It was causing GNOME's print notifications daemon to get blocked on startup until it timed out and systemd killed it; during that time, various other things didn't work, because some other necessary stuff doesn't get started up till the print notifications daemon is running...
From https://github.com/fish-shell/fish-shell/issues/6440 it seems like macOS is probably OK since 10.12, and that's a long way out of support lifetime now.
https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-gettickcount64 seems like possibly the best option on Windows.
Googling around suggests CLOCK_MONOTONIC does exist on Solaris (as an alias to CLOCK_HIGHRES - see https://github.com/vim/vim/issues/10647 ), but does not exist on HP-UX - https://github.com/vim/vim/issues/10873 .
https://github.com/nmap/nmap/issues/180 has even more icky details, including some differences in CLOCK_MONOTONIC behaviour across platforms.
Various people who've run into this mess before suggest things like falling back to gettimeofday or setitimer/getitimer. There's a library at https://github.com/ThomasHabets/monotonic_clock which tries to handle various platforms but only has tested implementations for Linux, OpenBSD and Solaris, plus apparently untested mach and Windows support.
In summary I don't know what to do! Maybe get/setitimer would be viable here? I dunno.
@AdamWill What I've pushed thus far uses GetTickCount64
on Windows, clock_gettime(CLOCK_MONOTONIC, ...)
when supported, with gettimeofday
as a backup. Since modern platforms do support CLOCK_MONOTONIC
and we don't have testing/integration resources for legacy Unix platforms, using interval timers or other platform-specific code would be problematic.
The new proposed API is a simple "elapsed time in seconds" API with a defined resolution of at least 10ms:
double cupsGetClock(void);
The use of "double" allows for 54 bits of precision and effectively unlimited run time support. The implementation bases the return values on the first call to cupsGetClock
, so the first call returns 0.0, calling 42 seconds later would return approximately 42.0, etc.
[master 339544359] Add cupsGetClock API.
With the new API, it should be fairly straight-forward to refactor the dest and other timing-specific code and eliminate a bunch of the platform-specific code that gets sprinkled throughout.
I will also note, if this gets back-ported to 2.4.x we will name it _cupsGetClock
to avoid introducing a new public API in a stable release.
Looks good to me!
Some of the references I found suggested CLOCK_MONOTONIC_RAW might be preferred over CLOCK_MONOTONIC for this purpose if it's present, but not 100% sure.
Some of the references I found suggested CLOCK_MONOTONIC_RAW might be preferred over CLOCK_MONOTONIC for this purpose if it's present, but not 100% sure.
It looks like the difference is CLOCK_MONOTONIC_RAW
ignores frequency adjustments, so one second is always actually one second, even if an NTP daemon has decided to make seconds shorter or longer in order to slowly adjust the system clock to the correct time. So I guess CLOCK_MONOTONIC_RAW
is probably almost always what you really want. But it's only available on Linux.
I had never heard of it until last week, when I noticed dbus-broker was using it. GLib uses vanilla CLOCK_MONOTONIC
though, and that affects basically the entire desktop. Maybe GLib should switch, but it seems like a very small problem, so whatever.
Using gettimeofday here is not safe against clock jumps (e.g. a manual setting of the clock). Use clock_gettime instead.