awslabs / cdk-serverless-clamscan

Apache License 2.0
234 stars 66 forks source link

Intermitting Lambda Timeout #1259

Open jrb331 opened 1 month ago

jrb331 commented 1 month ago

I have had the CDK stack running in production for roughly 3 months without issue. Starting earlier this week, we are seeing an issue where the lambdas time out after the 10 minute max. S3 objects are stuck with a scan-status tag of IN PROGRESS or sometimes the scan hangs before they are event tagged. I saw this issue describing a similar problem, but updating the cdk-serverless-clamscan package to 2.6.230 did not seem to resolve the issue. I would appreciate any guidance you have here. I'm happy to provide any additional detail Screenshot 2024-07-18 at 3 11 27 PM

dontirun commented 1 month ago

Are there any error messages in the CloudWatch Logs?

jrb331 commented 1 month ago

All logs show a Task timed out after 600 seconds error message. Below is a handful of them. Is there any way to increase the verbosity of the logs? log-events-viewer-result.xlsx

jrb331 commented 1 month ago

Also potentially worth noting that some objects are stuck in an IN PROGRESS state (which occurs here) after lambda failure, where some time out before the object gets tagged at all. That is the only indication of where the issue might be happening that I've been able to discern thus far

jrb331 commented 1 month ago

Screenshot 2024-07-19 at 12 41 37 PM

This timeline of metrics illustrates the problem well I think. I began with small numbers of concurrent uploads (75 every 5 minutes), and saw no issue. I began increasing the number of concurrent uploads first to 150 and then to 300. As you can see, there are still no durations close to long enough to cause a time out, and 100% of the executions were successful. Once I bumped the number up to 1000, however, the problem resurfaced. You can see the execution duration balloon to the max configured for the lambda, the errors begin to surface in bulk, the 3 retries being attempted (unsuccessfully), and finally, the majority of events being dropped. You can also see that I'm maxing out at a concurrency of ~800. I don't have any other lambdas in this account that fire frequently enough to push me over the 1000 executions limit. Hopefully this additional context is helpful

jrb331 commented 1 month ago
Screenshot 2024-07-19 at 12 51 50 PM Screenshot 2024-07-19 at 12 52 10 PM

It seems like cold starts may be related. Some observations:

The questions I have are the following: Why would an increased quantity of cold starts increase the amount of time each individual lambda takes to execute?

The above screenshots show the cold starts were required even for the batches that all executed successfully, so why does the task duration balloon when a large number of lambdas require a cold start?

dontirun commented 1 month ago

The questions I have are the following: Why would an increased quantity of cold starts increase the amount of time each individual lambda takes to execute? The above screenshots show the cold starts were required even for the batches that all executed successfully, so why does the task duration balloon when a large number of lambdas require a cold start?

From the perspective of the Lambda Application code, a cold start means that the ClamAV and Freshclam process both need to start up. Both of those take up CPU/memory. In addition, Freshclam will need to update the local virus database from the EFS filesystem.

As to why the coldstarts are happening so frequently, that would need to be answered by AWS Support

jrb331 commented 1 month ago

Is it possible to increase the verbosity of the logs? Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

dontirun commented 1 month ago

Is it possible to increase the verbosity of the logs

What specifically are you looking for? It's possible to add log statements as the Scanning Lambda function enters each "phase" of the process. The Lambda does log when a specific process throws an error, but timeouts wouldn't be logged in CloudWatch Logs as the execution halts.

jrb331 commented 1 month ago

I'd like to see if I can narrow in on where the execution is timing out. Adding log statements at each phase sounds like it would help achieve this. Could you help me understand how to enable this?

dontirun commented 1 month ago

Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

This is going to be use case specific (file type, file size, scan frequency). I haven't performed testing that would provide any real guidance. I have personally seen that larger configurations result in faster executions both in cold starts and overall.

jrb331 commented 1 month ago

Amazing. Thank you for that. Is it possible to pull these changes into my project without a new version of the cdk-serverless-clamscan package?

dontirun commented 1 month ago

Amazing. Thank you for that. Is it possible to pull these changes into my project without a new version of the cdk-serverless-clamscan package?

You would have to manually edit your local files to reflect the changes. I can not guarantee that manual changes like this won't result in a failed deployment

jrb331 commented 1 month ago

Apologies if I'm misunderstanding your recommendation, but I don't host the files of this repo locally. I import them into my node project via the cdk-serverless-clamscan package and instantiate a ServerlessClamscan object in my cdk code. Would I have to host my own custom fork of this repo in order to achieve enhanced logging?

jrb331 commented 1 month ago

Also, have you seen certain CPU/memory configurations function better than others? I know that 10,240 MB of memory is the default but is that recommended? I currently have 3008 MB provisioned.

This is going to be use case specific (file type, file size, scan frequency). I haven't performed testing that would provide any real guidance. I have personally seen that larger configurations result in faster executions both in cold starts and overall.

Also, FWIW, I tried increasing the lambda memory to 10240 and am still facing the same issue. I also validated that the files being sent to the construct have sizes on the order of KBs, which I've been able to handle in the past

dontirun commented 1 month ago

Apologies if I'm misunderstanding your recommendation, but I don't host the files of this repo locally. I import them into my node project via the cdk-serverless-clamscan package and instantiate a ServerlessClamscan object in my cdk code. Would I have to host my own custom fork of this repo in order to achieve enhanced logging?

I recommend that you update to the latest version. If you don't want to (not recommended) you would need to directly modify the files in your node_modulesfolder and then deploy