Rebooting GitLab may trigger ClamAV alarm #6114

Closed dsotirho-ucsc closed 2 months ago

The azul-clamscan-<deployment> alarm is triggered if a clamscan succeeded log message is not produced within an 18 hour period. Since the ClamAV scan is run twice daily, and takes many hours to complete, it is possible for a reboot of the GitLab instance (due to an update, backup, or testing) to cancel an ongoing ClamAV scan and prevent a successful completion of a scan to fall within an 18 hour window since one last completed.

Recommended solution is to increase the alarm's period to 24 hours.

Note: 24 hours is the maximum allowed time period for an alarm with one evaluation period. (from: Common features of CloudWatch alarms)

The number of evaluation periods for an alarm multiplied by the length of each evaluation period can't exceed one day.

Assignee to provide symptoms and solution in description.

~For demo, reboot a GL instance on day before demo while scan is ongoing (prepare proof). Show that alarm did not go off.~

I don't think we need an elaborate demo. We discovered that the attempted fixes from the first two PRs (#6155 and #6315) weren't effective before we even got to the demo. IOW, we will likely make the same discovery about PR #6374 organically, during normal operations.

@hannes-ucsc: "Rebooting the instance still results in a false alarm, for example:"

https://groups.google.com/a/ucsc.edu/g/azul-group/c/zkmzRVv_rec/m/WbNWhMsKBQAJ

You are receiving this email because your Amazon CloudWatch Alarm "azul-clamscan-dev.alarm" in the US East (N. Virginia) region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Saturday 27 April, 2024 09:48:13 UTC".

View this alarm in the AWS Management Console: https://us-east-1.console.aws.amazon.com/cloudwatch/deeplink.js?region=us-east-1#alarmsV2:alarm/azul-clamscan-dev.alarm

Alarm Details:

Name: azul-clamscan-dev.alarm

Description:

State Change: OK -> ALARM

Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).

Timestamp: Saturday 27 April, 2024 09:48:13 UTC

AWS Account: 122796619775

Alarm Arn: arn:aws:cloudwatch:us-east-1:122796619775:alarm:azul-clamscan-dev.alarm

Threshold:

The alarm is in the ALARM state when the metric is LessThanThreshold 1.0 for at least 1 of the last 1 period(s) of 86400 seconds.

Monitored Metrics:

MetricExpression: FILL(log_count_raw, 0)

MetricLabel: No Label

State Change Actions:

OK: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]

ALARM: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]

INSUFFICIENT_DATA:

Assignee to consider increasing the frequency of the cronjob to */18 hours

There is a contradiction in the above comment: */18 would not be an increase. Assignee to formalize plan.

The alarm fires when a successful clamscan message wasn't logged within the last 24 hours. On average, a successful scan takes anywhere from 10 - 14 hours ( quicker on anvildev and prod, slower on anvilprod and dev).

Currently clamscan is set up to run twice a day. This causes the alarm to fire if the scan following a reboot takes longer than the scan that completed just prior to the reboot.

For this example, assume a 11 hour scan starting at 00 and 12:

start   00:00
end     11:00
start   12:00
reboot  13:00
start   00:00
end     11:05 (alarm fired at 11:01)

Since systemd timers won't start a service that is still running from its last activation by a timer, I purpose setting the clamscan's timer to run 6 times a day, or every 4 hours (*-*-* */4:00:00).

Example, 11 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     11:00 *
start   12:00
reboot  13:00
start   16:00
end     03:00 *

11:00 to 03:00 = 16 hours

Example, 14 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     14:00 *
start   16:00
reboot  17:00
start   20:00
end     10:00 *

14:00 to 10:00 = 20 hours

@hannes-ucsc: "Let's just start the unit every hour. If the scan takes less than an hour, it's actually desirable to start it on the next full hour. Extra care to be taken to ensure that the scans aren't running in parallel or overlap."

This is still an issue.

PR #6155 increased the alarm period to 24 hours, so that if Clamscan didn't successfully complete within the last 24 hours, the alarm would fire.
PR #6315 increased the frequency of the scheduled Clamscan job, so that after a reboot, Clamscan would start on the hour (instead of waiting for either 2am or 2pm).

Unfortunately, this did not solve the issue for dev where the scan can take 12-14 hours to complete.

From: https://groups.google.com/a/ucsc.edu/g/azul-group/c/A6nfV6MFBEY/m/kS7hIwcTAwAJ

Deployment	VolSize	Used
dev	200	117
anvildev	150	86
prod	100	66
anvilprod	150	76

@hannes: It makes sense that dev scans take the longest because there is the most data. Assuming that we only reboot a GitLab instance once per day, if the scan takes longer that 12 hours and the reboot occurs close to the end of an ongoing scan, we will observe this alarm.

Spike for design.

A possible solution may be to make the clamscan alarm a composite alarm consisting of:

1) What the current clamscan alarm does, if no success message is available for the past 24H, this sub-alarm goes into ALARM state, and it's OK otherwise. 2) Similar to the configuration of the 1st alarm, but hook on the log message Starting Reboot, if such message is observed within the past 24H this sub-alarm is in OK state, and goes into ALARM state otherwise / INSUFFICIENT_DATA.

These two sub-alarms don't have alarm state change actions. The composite alarm is the one with state change actions (notify Amazon SNS monitoring topic). It only changes to ALARM state when both alarms in ALARM state and it's OK otherwise. This also implies that the 2nd sub-alarm might be in ALARM state most of the time unless there's been a recent recreation of the GitLab instance, for that deployment.

@hannes-ucsc: "We walked through the scenario depicted in the screenshot below and determined that the proposed solution would address the issue at hand. Assignee to implement it."

@hannes-ucsc: "Also need to include the Starting Power-Off… message. And also check for other candidates like systemd: Starting."

Note my edits to the demo instructions.

DataBiosphere / azul

Rebooting GitLab may trigger ClamAV alarm #6114