HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
125 stars 52 forks source link

MAX_WAIT_TIME for rescan should be a config option #348

Closed mattjala closed 2 months ago

mattjala commented 2 months ago

The value is currently hardcoded. Letting it be set would allow the runners to wait longer and avoid failures like this.

mattjala commented 2 months ago

This issue isn't due to scans taking a long time. The domain scans are actually getting stuck in an infinite wait due to inaccurate timestamps after #346

Sometimes, a scan would record a completion timestamp that was slightly BEFORE the recorded time that the rescan request was sent out. Because the check to stop waiting for the scan requires a scan finished timestamp later than the scan request time, it would never terminate and eventually return a 503. The inaccuracy occurs because the node that records the scan completion time is a different node than records the request time.

I'm not sure why getNow() is more inconsistent between nodes than time.time() - even when nodes start at different times, time.perf_counter() - app["start_time_relative"] should be a precise measure of how long the node has been online, and app["start_time]" should be an OS-precision UNIX timestamp. Adding them should produce unix timestamp for the for the current time which is no more inaccurate than time.time(). It shouldn't be a problem with async operations, since perf_counter continues to count during sleep and is system wide.