DWR Monitoring, Alerting and Issue Resolution Strategy
Goals
Enable DWR to monitor, identify and rectify most if not all of the DWR GOES17 data issues.
Justification
Because of the significantly larger data size and frequency of GOES17 data as compared to GOES15, data processing for Spatial CIMIS introduces significantly higher probability for data corruption. It is for this fact that a premature promotion of the DWR GOES17 Spatial CIMIS processes to production / live status will unnecessarily put the team (DWR & UCD) on endless alert potentially introducing delays in providing ETo data to customers.
What is the strategy?
Prior to promoting the DWR GOES17 Spatial CIMIS processes to production / live status, DWR should be able to demonstrate the ability to go 2 weeks without a major processing issue while being able to adequately address live data delivery issues in a timely manner in order to avoid data loss and an interruption to their ETo delivery responsibilities.
To accomplish this the following strategy should be considered:
Designate DWR personnel for the alert team
Create an alert email list populated with the alert team
There should be basic monitoring of critical systems (ping, http, ssh, etc.)
There should be monitoring of quality of real-time data which could produce erroneous data such as at UCD
Receive training on how to identify and resolve issues once an alert is sent out.
Currently there is documentation on how to resolve every known issue at UCD.
Solutions to better handle existing issues is always ongoing.
Specific resources to monitor
AppDynamics can monitor and provide basic host alert information such as general availability, CPU RAM & disk usage. In addition to general availability alerts these are the specific services that need monitoring with alerts.
CIMIS grb-box
search file for keywords dsp-box down in the following file:
/home/cimis/logs/status
when filesystem /grb reaches a certain % usage (80% ?) send a warning alert
CIMIS processor - test
search file for keywords max = 0 and no data in the following file:
http://process-test/status/band-2 (located in /var/www/status/band-2)
when filesystem /apps reaches a certain % usage (80% ?) send a warning alert
CIMIS processor -prod
search file for keywords max = 0 and no data in the following file:
http://process-prod/status/band-2 (located in /var/www/status/band-2)
when filesystem /apps reaches a certain % usage (80% ?) send a warning alert
Requested Strategy
This strategy requires DWR firewall rules to allow remote monitoring service to access ports 22, 80 and 443 for the following Spatial CIMIS servers are required:
DWR Monitoring, Alerting and Issue Resolution Strategy
Goals
Enable DWR to monitor, identify and rectify most if not all of the DWR GOES17 data issues.
Justification
Because of the significantly larger data size and frequency of GOES17 data as compared to GOES15, data processing for Spatial CIMIS introduces significantly higher probability for data corruption. It is for this fact that a premature promotion of the DWR GOES17 Spatial CIMIS processes to production / live status will unnecessarily put the team (DWR & UCD) on endless alert potentially introducing delays in providing ETo data to customers.
What is the strategy?
Prior to promoting the DWR GOES17 Spatial CIMIS processes to production / live status, DWR should be able to demonstrate the ability to go 2 weeks without a major processing issue while being able to adequately address live data delivery issues in a timely manner in order to avoid data loss and an interruption to their ETo delivery responsibilities.
To accomplish this the following strategy should be considered:
Specific resources to monitor
AppDynamics can monitor and provide basic host alert information such as general availability, CPU RAM & disk usage. In addition to general availability alerts these are the specific services that need monitoring with alerts.
CIMIS grb-box
CIMIS processor - test
CIMIS processor -prod
Requested Strategy
This strategy requires DWR firewall rules to allow remote monitoring service to access ports 22, 80 and 443 for the following Spatial CIMIS servers are required:
Must know public facing IPs for all spatial cimis servers.
Remote monitoring service is Uptime Robot. IP's to white list are listed here: https://uptimerobot.com/locations.php https://uptimerobot.com/inc/files/ips/IPv4.txt