Closed dsotirho-ucsc closed 2 months ago
Assignee to provide symptoms and solution in description.
~For demo, reboot a GL instance on day before demo while scan is ongoing (prepare proof). Show that alarm did not go off.~
I don't think we need an elaborate demo. We discovered that the attempted fixes from the first two PRs (#6155 and #6315) weren't effective before we even got to the demo. IOW, we will likely make the same discovery about PR #6374 organically, during normal operations.
@hannes-ucsc: "Rebooting the instance still results in a false alarm, for example:"
https://groups.google.com/a/ucsc.edu/g/azul-group/c/zkmzRVv_rec/m/WbNWhMsKBQAJ
You are receiving this email because your Amazon CloudWatch Alarm "azul-clamscan-dev.alarm" in the US East (N. Virginia) region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Saturday 27 April, 2024 09:48:13 UTC".
View this alarm in the AWS Management Console: https://us-east-1.console.aws.amazon.com/cloudwatch/deeplink.js?region=us-east-1#alarmsV2:alarm/azul-clamscan-dev.alarm
Alarm Details:
- Name: azul-clamscan-dev.alarm
- Description:
- State Change: OK -> ALARM
- Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).
- Timestamp: Saturday 27 April, 2024 09:48:13 UTC
- AWS Account: 122796619775
- Alarm Arn: arn:aws:cloudwatch:us-east-1:122796619775:alarm:azul-clamscan-dev.alarm
Threshold:
- The alarm is in the ALARM state when the metric is LessThanThreshold 1.0 for at least 1 of the last 1 period(s) of 86400 seconds.
Monitored Metrics:
- MetricExpression: FILL(log_count_raw, 0)
- MetricLabel: No Label
State Change Actions:
- OK: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]
- ALARM: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]
- INSUFFICIENT_DATA:
Assignee to consider increasing the frequency of the cronjob to */18
hours
There is a contradiction in the above comment: */18 would not be an increase. Assignee to formalize plan.
The alarm fires when a successful clamscan message wasn't logged within the last 24 hours. On average, a successful scan takes anywhere from 10 - 14 hours ( quicker on anvildev
and prod
, slower on anvilprod
and dev
).
Currently clamscan is set up to run twice a day. This causes the alarm to fire if the scan following a reboot takes longer than the scan that completed just prior to the reboot.
For this example, assume a 11 hour scan starting at 00 and 12:
start 00:00
end 11:00
start 12:00
reboot 13:00
start 00:00
end 11:05 (alarm fired at 11:01)
Since systemd timers won't start a service that is still running from its last activation by a timer, I purpose setting the clamscan's timer to run 6 times a day, or every 4 hours (*-*-* */4:00:00
).
Example, 11 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):
start 00:00
end 11:00 *
start 12:00
reboot 13:00
start 16:00
end 03:00 *
11:00 to 03:00 = 16 hours
Example, 14 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):
start 00:00
end 14:00 *
start 16:00
reboot 17:00
start 20:00
end 10:00 *
14:00 to 10:00 = 20 hours
@hannes-ucsc: "Let's just start the unit every hour. If the scan takes less than an hour, it's actually desirable to start it on the next full hour. Extra care to be taken to ensure that the scans aren't running in parallel or overlap."
This is still an issue.
Unfortunately, this did not solve the issue for dev
where the scan can take 12-14 hours to complete.
From: https://groups.google.com/a/ucsc.edu/g/azul-group/c/A6nfV6MFBEY/m/kS7hIwcTAwAJ
Deployment | VolSize | Used |
---|---|---|
dev | 200 | 117 |
anvildev | 150 | 86 |
prod | 100 | 66 |
anvilprod | 150 | 76 |
@hannes: It makes sense that dev scans take the longest because there is the most data. Assuming that we only reboot a GitLab instance once per day, if the scan takes longer that 12 hours and the reboot occurs close to the end of an ongoing scan, we will observe this alarm.
Spike for design.
A possible solution may be to make the clamscan
alarm a composite alarm consisting of:
1) What the current clamscan
alarm does, if no success message is available for the past 24H, this sub-alarm goes into ALARM state, and it's OK otherwise.
2) Similar to the configuration of the 1st alarm, but hook on the log message Starting Reboot
, if such message is observed within the past 24H this sub-alarm is in OK state, and goes into ALARM state otherwise / INSUFFICIENT_DATA.
These two sub-alarms don't have alarm state change actions. The composite alarm is the one with state change actions (notify Amazon SNS monitoring topic). It only changes to ALARM state when both alarms in ALARM state and it's OK otherwise. This also implies that the 2nd sub-alarm might be in ALARM state most of the time unless there's been a recent recreation of the GitLab instance, for that deployment.
@hannes-ucsc: "We walked through the scenario depicted in the screenshot below and determined that the proposed solution would address the issue at hand. Assignee to implement it."
@hannes-ucsc: "Also need to include the Starting Power-Off…
message. And also check for other candidates like systemd: Starting
."
Note my edits to the demo instructions.
The
azul-clamscan-<deployment>
alarm is triggered if aclamscan succeeded
log message is not produced within an 18 hour period. Since the ClamAV scan is run twice daily, and takes many hours to complete, it is possible for a reboot of the GitLab instance (due to an update, backup, or testing) to cancel an ongoing ClamAV scan and prevent a successful completion of a scan to fall within an 18 hour window since one last completed.Recommended solution is to increase the alarm's period to 24 hours.
Note: 24 hours is the maximum allowed time period for an alarm with one evaluation period. (from: Common features of CloudWatch alarms)