When OpenTSDB is not enabled, the processing of metrics sending to OpenTSDB is in vain.
The underlying reason to make this change is to make the scheduler run more accurately.
In production, it takes about 100 - 300ms to process these metrics.
Suppose the time to process metric is always 200ms and one alert is scheduled to run
every minute, the actual number of alert execution for one day becomes
60 60 24 / 60.2 = 1435.2, less than expected 1440.
Whether the reduced 5 times execution matters or not depends on use cases and people may
have different opinions.
The real problem we have is one important minutely SLO metric bosun_uptime relying on the accuracy
of the scheduler. In current situation, because of this extra processing time, every few minutes,
the minutely alert starting time is delayed 1s, which causes the metric missing problem.
Ideally, we may introduce jitter to reduce the impact of metrics processing time or optimze the processing
time, but both are tricky to implement. This change is not very elegant but straightforward.
Type of change
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Description
When OpenTSDB is not enabled, the processing of metrics sending to OpenTSDB is in vain.
The underlying reason to make this change is to make the scheduler run more accurately.
In production, it takes about 100 - 300ms to process these metrics. Suppose the time to process metric is always 200ms and one alert is scheduled to run every minute, the actual number of alert execution for one day becomes 60 60 24 / 60.2 = 1435.2, less than expected 1440. Whether the reduced 5 times execution matters or not depends on use cases and people may have different opinions.
The real problem we have is one important minutely SLO metric bosun_uptime relying on the accuracy of the scheduler. In current situation, because of this extra processing time, every few minutes, the minutely alert starting time is delayed 1s, which causes the metric missing problem.
Ideally, we may introduce jitter to reduce the impact of metrics processing time or optimze the processing time, but both are tricky to implement. This change is not very elegant but straightforward.
Type of change
How has this been tested?
Test in production
Checklist: