Closed OllieJones closed 1 year ago
The outliers are extreme. Some measurements show only a tiny number of outlier timings among thousands. Some "max" times are much bigger than the "p99" (99th percentile) times
I don't understand why. I've confirmed all the file systems involved are on locally attached disks, so we don't have some sort of nfs / smb / cifs type of access delay.
Without knowledge of why these delays happen and some way to predict them, there's nothing much to do except increase the timeout. I set it to 5sec.
It's possible some of the timeouts and slowness are nfs / cifs related. https://wordpress.org/support/topic/uncaught-exception-unable-to-execute-statement-database-is-locked/#post-16434734
I wonder if there's a way to detect a network-attached file system via stat? If so, could warn user.
I think some of this may have been due to the use of VACUUM blocking access. That's gone from v1.3.2.
Haven't seen this recur since realizing that VACUUM sucks and fixing the code.
Statistics are showing, on Greengeeks, some extreme outliers in the times to save and load cache objects. (Many hundreds of milliseconds, compared to median values of hundreds of milliseconds.
So, increase the timeout from 500ms to 5s.
Switch the journaling mode to WAL (from MEMORY).
Report statistics on p1 and p99 (the one-percentile and 99th-percentile) times along with median, p5, and p95, to try to get a handle on whether these large times are basically one-off problems or recurrent.