DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Rebooting GitLab may trigger ClamAV alarm #6114

Closed dsotirho-ucsc closed 2 months ago

dsotirho-ucsc commented 6 months ago

The azul-clamscan-<deployment> alarm is triggered if a clamscan succeeded log message is not produced within an 18 hour period. Since the ClamAV scan is run twice daily, and takes many hours to complete, it is possible for a reboot of the GitLab instance (due to an update, backup, or testing) to cancel an ongoing ClamAV scan and prevent a successful completion of a scan to fall within an 18 hour window since one last completed.

Recommended solution is to increase the alarm's period to 24 hours.

Note: 24 hours is the maximum allowed time period for an alarm with one evaluation period. (from: Common features of CloudWatch alarms)

The number of evaluation periods for an alarm multiplied by the length of each evaluation period can't exceed one day.

dsotirho-ucsc commented 6 months ago

Assignee to provide symptoms and solution in description.

hannes-ucsc commented 5 months ago

~For demo, reboot a GL instance on day before demo while scan is ongoing (prepare proof). Show that alarm did not go off.~

I don't think we need an elaborate demo. We discovered that the attempted fixes from the first two PRs (#6155 and #6315) weren't effective before we even got to the demo. IOW, we will likely make the same discovery about PR #6374 organically, during normal operations.

dsotirho-ucsc commented 5 months ago

@hannes-ucsc: "Rebooting the instance still results in a false alarm, for example:"

https://groups.google.com/a/ucsc.edu/g/azul-group/c/zkmzRVv_rec/m/WbNWhMsKBQAJ

You are receiving this email because your Amazon CloudWatch Alarm "azul-clamscan-dev.alarm" in the US East (N. Virginia) region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Saturday 27 April, 2024 09:48:13 UTC".

View this alarm in the AWS Management Console: https://us-east-1.console.aws.amazon.com/cloudwatch/deeplink.js?region=us-east-1#alarmsV2:alarm/azul-clamscan-dev.alarm

Alarm Details:

  • Name: azul-clamscan-dev.alarm
  • Description:
  • State Change: OK -> ALARM
  • Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).
  • Timestamp: Saturday 27 April, 2024 09:48:13 UTC
  • AWS Account: 122796619775
  • Alarm Arn: arn:aws:cloudwatch:us-east-1:122796619775:alarm:azul-clamscan-dev.alarm

Threshold:

  • The alarm is in the ALARM state when the metric is LessThanThreshold 1.0 for at least 1 of the last 1 period(s) of 86400 seconds.

Monitored Metrics:

  • MetricExpression: FILL(log_count_raw, 0)
  • MetricLabel: No Label

State Change Actions:

  • OK: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]
  • ALARM: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]
  • INSUFFICIENT_DATA:
dsotirho-ucsc commented 5 months ago

Assignee to consider increasing the frequency of the cronjob to */18 hours

dsotirho-ucsc commented 4 months ago

There is a contradiction in the above comment: */18 would not be an increase. Assignee to formalize plan.

dsotirho-ucsc commented 4 months ago

The alarm fires when a successful clamscan message wasn't logged within the last 24 hours. On average, a successful scan takes anywhere from 10 - 14 hours ( quicker on anvildev and prod, slower on anvilprod and dev).

Currently clamscan is set up to run twice a day. This causes the alarm to fire if the scan following a reboot takes longer than the scan that completed just prior to the reboot.

For this example, assume a 11 hour scan starting at 00 and 12:

start   00:00
end     11:00
start   12:00
reboot  13:00
start   00:00
end     11:05 (alarm fired at 11:01)

Since systemd timers won't start a service that is still running from its last activation by a timer, I purpose setting the clamscan's timer to run 6 times a day, or every 4 hours (*-*-* */4:00:00).

Example, 11 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     11:00 *
start   12:00
reboot  13:00
start   16:00
end     03:00 *

11:00 to 03:00 = 16 hours

Example, 14 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     14:00 *
start   16:00
reboot  17:00
start   20:00
end     10:00 *

14:00 to 10:00 = 20 hours
dsotirho-ucsc commented 4 months ago

@hannes-ucsc: "Let's just start the unit every hour. If the scan takes less than an hour, it's actually desirable to start it on the next full hour. Extra care to be taken to ensure that the scans aren't running in parallel or overlap."

dsotirho-ucsc commented 3 months ago

This is still an issue.

Unfortunately, this did not solve the issue for dev where the scan can take 12-14 hours to complete.

From: https://groups.google.com/a/ucsc.edu/g/azul-group/c/A6nfV6MFBEY/m/kS7hIwcTAwAJ

Deployment VolSize Used
dev 200 117
anvildev 150 86
prod 100 66
anvilprod 150 76

@hannes: It makes sense that dev scans take the longest because there is the most data. Assuming that we only reboot a GitLab instance once per day, if the scan takes longer that 12 hours and the reboot occurs close to the end of an ongoing scan, we will observe this alarm.

dsotirho-ucsc commented 3 months ago

Spike for design.

achave11-ucsc commented 3 months ago

A possible solution may be to make the clamscan alarm a composite alarm consisting of:

1) What the current clamscan alarm does, if no success message is available for the past 24H, this sub-alarm goes into ALARM state, and it's OK otherwise. 2) Similar to the configuration of the 1st alarm, but hook on the log message Starting Reboot, if such message is observed within the past 24H this sub-alarm is in OK state, and goes into ALARM state otherwise / INSUFFICIENT_DATA.

These two sub-alarms don't have alarm state change actions. The composite alarm is the one with state change actions (notify Amazon SNS monitoring topic). It only changes to ALARM state when both alarms in ALARM state and it's OK otherwise. This also implies that the 2nd sub-alarm might be in ALARM state most of the time unless there's been a recent recreation of the GitLab instance, for that deployment.

dsotirho-ucsc commented 3 months ago

@hannes-ucsc: "We walked through the scenario depicted in the screenshot below and determined that the proposed solution would address the issue at hand. Assignee to implement it."

Screenshot 2024-06-27 at 12 21 46 PM
achave11-ucsc commented 2 months ago

@hannes-ucsc: "Also need to include the Starting Power-Off… message. And also check for other candidates like systemd: Starting." image

hannes-ucsc commented 2 months ago

Note my edits to the demo instructions.