awslabs / cdk-serverless-clamscan

Apache License 2.0
237 stars 67 forks source link

possible efs exhaustion due to directories not being cleaned up #1308

Open davidmwhynot opened 2 days ago

davidmwhynot commented 2 days ago

Package version: cdk-serverless-clamscan@2.8.22

About 2 weeks ago, I encountered an issue with my deployment of the cdk-serverless-clamscan solution. The solution had been humming along without any problems for about a month when this issue happened. I'll say right out of the gate that the last stack update I did that would have updated the function images was ~ 7 days before the issue occurred, however, once I noticed the issue I did run cdk deploy and it did not resolve the issue.

The problem was as follows: around 2:20pm EST on 9/27, the lambda function that performs the virus scan on a file uploaded to the scan bucket started to either time out or take an extreme amount of time given the size of the files that were being scanned (> 5 minutes to scan a file < 100kb in size).

Fortunately, the usage of this solution in our application is feature flagged so, once the problem was identified, I was able to quickly disable uploads to the scan bucket and bypass the virus scanning process.

Once I'd done that, I started looking into the lambda function logs. Here's a sample log of an execution that timed out early on:

2024-09-27 18:27:14.653 START RequestId: cc9b94e4-c38d-4185-a97d-c7af4b1582cd Version: $LATEST
2024-09-27 18:27:14.654 {"level":"INFO","location":"decorate:448","message":{"Records":[{"eventVersion":"2.1","eventSource":"aws:s3","awsRegion":"us-east-2","eventTime":"2024-09-27T18:27:12.440Z","eventName":"ObjectCreated:Copy","userIdentity":{"principalId":"AWS:<REDACTED>"},"requestParameters":{"sourceIPAddress":"<REDACTED>"},"responseElements":{"x-amz-request-id":"<REDACTED>","x-amz-id-2":"<REDACTED>"},"s3":{"s3SchemaVersion":"1.0","configurationId":"<REDACTED>","bucket":{"name":"<REDACTED>","ownerIdentity":{"principalId":"<REDACTED>"},"arn":"arn:aws:s3:::<REDACTED>"},"object":{"key":"<REDACTED>","size":9390,"eTag":"<REDACTED>","sequencer":"<REDACTED>"}}}]},"timestamp":"2024-09-27 18:27:14,653+0000","service":"virus-scan","cold_start":true,"function_name":"<REDACTED>","function_memory_size":"3072","function_arn":"arn:aws:lambda:us-east-2:<REDACTED>:function:<REDACTED>","function_request_id":"cc9b94e4-c38d-4185-a97d-c7af4b1582cd","xray_trace_id":"<REDACTED>"}
2024-09-27 18:27:14.654 {"level":"INFO","location":"lambda_handler:67","message":{"Records":[{"eventVersion":"2.1","eventSource":"aws:s3","awsRegion":"us-east-2","eventTime":"2024-09-27T18:27:12.440Z","eventName":"ObjectCreated:Copy","userIdentity":{"principalId":"AWS:<REDACTED>"},"requestParameters":{"sourceIPAddress":"<REDACTED>"},"responseElements":{"x-amz-request-id":"<REDACTED>","x-amz-id-2":"<REDACTED>"},"s3":{"s3SchemaVersion":"1.0","configurationId":"<REDACTED>","bucket":{"name":"<REDACTED>","ownerIdentity":{"principalId":"<REDACTED>"},"arn":"arn:aws:s3:::<REDACTED>"},"object":{"key":"<REDACTED>","size":9390,"eTag":"<REDACTED>","sequencer":"<REDACTED>"}}}]},"timestamp":"2024-09-27 18:27:14,654+0000","service":"virus-scan","cold_start":true,"function_name":"<REDACTED>","function_memory_size":"3072","function_arn":"arn:aws:lambda:us-east-2:<REDACTED>:function:<REDACTED>","function_request_id":"cc9b94e4-c38d-4185-a97d-c7af4b1582cd","xray_trace_id":"<REDACTED>"}
2024-09-27 18:27:14.654 Attempting to set scan-status of <REDACTED> to IN PROGRESS
2024-09-27 18:27:14.780 Attempting to create the EFS filepath /mnt/lambda/cc9b94e4-c38d-4185-a97d-c7af4b1582cd/29d5ea40ff33170a9c567e9ec799955069dace7ccfeaa0236804c823766b015d if it does not already exist
2024-09-27 18:27:14.803 Attempting to create the EFS filepath /mnt/lambda/cc9b94e4-c38d-4185-a97d-c7af4b1582cd-tmp/29d5ea40ff33170a9c567e9ec799955069dace7ccfeaa0236804c823766b015d if it does not already exist
2024-09-27 18:27:14.826 Attempting to download of <REDACTED> from the <REDACTED> S3 Bucket to the EFS filepath /mnt/lambda/cc9b94e4-c38d-4185-a97d-c7af4b1582cd
2024-09-27 18:27:14.949 Attempting to create the EFS filepath /mnt/lambda/virus_database/29d5ea40ff33170a9c567e9ec799955069dace7ccfeaa0236804c823766b015d if it does not already exist
2024-09-27 18:27:14.957 Attempting to update the virus database. Last update was at 0 epoch time. Current time is 1727461634.9567144
2024-09-27 18:27:15.424 Attempting to scan the file <REDACTED> from the <REDACTED> S3 Bucket located at the EFS path /mnt/lambda/cc9b94e4-c38d-4185-a97d-c7af4b1582cd
2024-09-27 18:42:14.758 2024-09-27T18:42:14.758Z cc9b94e4-c38d-4185-a97d-c7af4b1582cd Task timed out after 900.11 seconds
2024-09-27 18:42:14.758 END RequestId: cc9b94e4-c38d-4185-a97d-c7af4b1582cd
2024-09-27 18:42:14.758 REPORT RequestId: cc9b94e4-c38d-4185-a97d-c7af4b1582cd  Duration: 900105.65 ms  Billed Duration: 900642 ms  Memory Size: 3072 MB    Max Memory Used: 1022 MB    Init Duration: 641.17 ms

The above log was obtained with the following logs insights query:

filter @message like "cc9b94e4-c38d-4185-a97d-c7af4b1582cd"
| display @timestamp, coalesce(message, @message)
| sort @timestamp asc
| limit 1000

The log file seems to indicate that the function is getting hung up while running clamscan. Most exeuctions finish that part of the process in < 20 seconds. While most exeuctions timed out, there were a few that would finish before the function timeout, albeit after an emormously long time:

2024-09-27 18:21:24.431 START RequestId: 70a0e989-0fb2-4100-bffd-955aa24e55d5 Version: $LATEST
2024-09-27 18:21:24.431 {"level":"INFO","location":"decorate:448","message":{"Records":[{"eventVersion":"2.1","eventSource":"aws:s3","awsRegion":"us-east-2","eventTime":"2024-09-27T18:21:21.477Z","eventName":"ObjectCreated:Copy","userIdentity":{"principalId":"<REDACTED>"},"requestParameters":{"sourceIPAddress":"<REDACTED>"},"responseElements":{"x-amz-request-id":"<REDACTED>","x-amz-id-2":"<REDACTED>"},"s3":{"s3SchemaVersion":"1.0","configurationId":"<REDACTED>","bucket":{"name":"<REDACTED>","ownerIdentity":{"principalId":"<REDACTED>"},"arn":"arn:aws:s3:::<REDACTED>"},"object":{"key":"<REDACTED>","size":22425,"eTag":"<REDACTED>","sequencer":"<REDACTED>"}}}]},"timestamp":"2024-09-27 18:21:24,431+0000","service":"virus-scan","cold_start":true,"function_name":"<REDACTED>","function_memory_size":"3072","function_arn":"arn:aws:lambda:us-east-2:<REDACTED>:function:<REDACTED>","function_request_id":"70a0e989-0fb2-4100-bffd-955aa24e55d5","xray_trace_id":"<REDACTED>"}
2024-09-27 18:21:24.432 {"level":"INFO","location":"lambda_handler:67","message":{"Records":[{"eventVersion":"2.1","eventSource":"aws:s3","awsRegion":"us-east-2","eventTime":"2024-09-27T18:21:21.477Z","eventName":"ObjectCreated:Copy","userIdentity":{"principalId":"<REDACTED>"},"requestParameters":{"sourceIPAddress":"<REDACTED>"},"responseElements":{"x-amz-request-id":"<REDACTED>","x-amz-id-2":"<REDACTED>"},"s3":{"s3SchemaVersion":"1.0","configurationId":"<REDACTED>","bucket":{"name":"<REDACTED>","ownerIdentity":{"principalId":"<REDACTED>"},"arn":"arn:aws:s3:::<REDACTED>"},"object":{"key":"<REDACTED>","size":22425,"eTag":"<REDACTED>","sequencer":"<REDACTED>"}}}]},"timestamp":"2024-09-27 18:21:24,431+0000","service":"virus-scan","cold_start":true,"function_name":"<REDACTED>","function_memory_size":"3072","function_arn":"arn:aws:lambda:us-east-2:<REDACTED>:function:<REDACTED>","function_request_id":"70a0e989-0fb2-4100-bffd-955aa24e55d5","xray_trace_id":"<REDACTED>"}
2024-09-27 18:21:24.432 Attempting to set scan-status of <REDACTED> to IN PROGRESS
2024-09-27 18:21:24.543 Attempting to create the EFS filepath /mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5/28446eed66e4fd856f40b845d9b4f5ea6dd0ee4efb00bd0ba0db95754e63a4d4 if it does not already exist
2024-09-27 18:21:24.561 Attempting to create the EFS filepath /mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5-tmp/28446eed66e4fd856f40b845d9b4f5ea6dd0ee4efb00bd0ba0db95754e63a4d4 if it does not already exist
2024-09-27 18:21:24.578 Attempting to download of <REDACTED> from the <REDACTED> S3 Bucket to the EFS filepath /mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5
2024-09-27 18:21:24.698 Attempting to create the EFS filepath /mnt/lambda/virus_database/28446eed66e4fd856f40b845d9b4f5ea6dd0ee4efb00bd0ba0db95754e63a4d4 if it does not already exist
2024-09-27 18:21:24.704 Attempting to update the virus database. Last update was at 0 epoch time. Current time is 1727461284.7040343
2024-09-27 18:21:25.330 Attempting to scan the file <REDACTED> from the <REDACTED> S3 Bucket located at the EFS path /mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5
2024-09-27 18:28:54.606 Attempting to set scan-status of <REDACTED> to CLEAN
2024-09-27 18:28:54.704 Attempting to delete the file located at the EFS path /mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5
2024-09-27 18:28:54.721 Attempting to delete the file located at the EFS path /mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5-tmp
2024-09-27 18:28:54.738 {"level":"INFO","location":"lambda_handler:104","message":{"source":"serverless-clamscan","input_bucket":"<REDACTED>","input_key":"<REDACTED>","status":"CLEAN","message":"Scanning /mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5/<REDACTED>\n/mnt/lambda/70a0e989-0fb2-4100-bffd-955aa24e55d5/<REDACTED>: OK\n\n----------- SCAN SUMMARY -----------\nKnown viruses: 8698513\nEngine version: 0.103.11\nScanned directories: 2\nScanned files: 1\nInfected files: 0\nData scanned: 0.04 MB\nData read: 0.02 MB (ratio 2.00:1)\nTime: 449.270 sec (7 m 29 s)\nStart Date: 2024:09:27 18:21:25\nEnd Date:   2024:09:27 18:28:54\n"},"timestamp":"2024-09-27 18:28:54,737+0000","service":"virus-scan","cold_start":true,"function_name":"<REDACTED>","function_memory_size":"3072","function_arn":"arn:aws:lambda:us-east-2:<REDACTED>:function:<REDACTED>","function_request_id":"70a0e989-0fb2-4100-bffd-955aa24e55d5","xray_trace_id":"<REDACTED>"}
2024-09-27 18:28:54.739 END RequestId: 70a0e989-0fb2-4100-bffd-955aa24e55d5
2024-09-27 18:28:54.739 REPORT RequestId: 70a0e989-0fb2-4100-bffd-955aa24e55d5  Duration: 450297.65 ms  Billed Duration: 451014 ms  Memory Size: 3072 MB    Max Memory Used: 1666 MB    Init Duration: 716.17 ms

Note the Time: 449.270 sec (7 m 29 s) in the above log. That's almost 22x longer than the average 20s time that clamscan had been taking up until then.

My next step was looking at metrics to see why the clamscan might be slowing down.

Scanner function metrics around the time of the outage: Pasted image 20241011142903

EFS Metrics around the time of the outage: Pasted image 20241011132312

The throughput utilization graph above doesn't even tell the full story. Here's a better view for a wider time range and without an upper limit on the y axis Pasted image 20241011131608

The EFS metrics looked the most troubling to me, and the disk getting slammed would explain the long scan times. This wasn't a particularly high traffic period; the number of incoming requests was on par or lower than other days/weeks at this same time period. There also wasn't anything unusual about the file that's in the second log above, which, again, is the first file that took longer than the average of 20s. The invocation durations didn't slowly work their way up to the timeout, they went straight from 20s to 7 minutes to the timeout (15 minutes). The filesystem utilization also followed that pattern; it quickly grows from ~ 5-20% all the way up to > 100%.

One obvious explanation might be an extremely large file upload causing a spike in filesystem utilization. This, however, does not appear to be the case. Running the following logs insights query on the two hours leading up to the issue reveals that the largest file uploaded in that time frame was only 200kb:

filter level = "INFO" AND location = "decorate:448"
| fields message.Records.0.s3.object.size / 1000 as size
| display @timestamp, function_request_id, size
| order by size desc
| limit 100

At this point, since the feature flag had been off for awhile and thus there were no more files to be scanned, the filesystem utilization was back down and all retries were exhausted on inflight requests. I decided to manually upload some files to see if the system was functional again. Despite there being essentially zero load on the function and filesystem, these tests failed.

The next place I decided to look was at the filesystem directly. I made all the configuration changes necessary to launch an ec2 instance in the VPC that the solution provisions, sshd into it, and mounted the filesystem. The first thing that I noticed was that there were hundreds of thousands of (mostly empty) directories in the /lambda folder. Running ls -1 | wc -l took several minutes. During that time, I noticed that the metadata utilization of the EFS was pegged at 100% until the command completed. This was probably the source of the problem. It took several hours, but I was able to rm -rf everything in the lambda folder (except for the virus defs folder) and my tests succeeded when that was done.

After clearing the entire directory contents and running my tests, I noticed that a large number of folders were still hanging around in the lambda folder. It seems like this library needs to be updated to ensure those get cleaned up to prevent an issue like this from happening. This is the primary reason I'm opening this issue.

That being said, the reason I'm providing so much detail here is that this still doesn't quite add up in my head. Why would the performance of the lambda function drop off a cliff so suddenly? If this was a problem with the number of directories growing over time, wouldn't the performance have slowly degraded until it reached the timeout? Why did the utilization suddenly jump from relatively low to > 100%?

I'll also be opening a case with AWS support to determine if this could have been a fluke with EFS, but I think that's pretty unlikely. Still, maybe they will be able to provide some insights into the problem.