Open etschannen opened 6 years ago
The only thing we need for 6.0 is the version lag.
Status can use a TLogQueuingMetricsRequest to ask a primary TLog (isLocal && hasBestLocation) and a remote TLog (!isLocal && hasBestLocation) for their versions, and then report the difference in version. Divide the result by VERSIONS_PER_SECOND to get an approximate time lag.
The version lag has been added in https://github.com/apple/foundationdb/pull/492, so I am moving the milestone to 6.1
The most basic information to monitor is the version lag between the primary DC and the remote DC.
Data distribution monitoring may also want to be separated by DC, although this may be tricky to implement.
Some basic machine fault tolerance calculations should be updated, and a new DC fault tolerance should be added.
Network latencies between DCs would be nice to have.
Since satellites may not be in active use in some configurations, their failure monitoring may need to be done differently.