bluesentry / bucket-antivirus-function

Serverless antivirus for cloud storage.
Apache License 2.0
536 stars 388 forks source link

Clamav lambda timeouts #130

Open code-memento opened 4 years ago

code-memento commented 4 years ago

Hi,

We have some weird behavior, for a year or more the lambda functions worked without any issues.

Up until recently, the scan takes like forever and is stopped by the lambda timeout.

Any ideas ?

Thanks and regards

2020-07-06T16:55:28.822+01:00 Attempting to create directory /tmp/clamav_defs.
2020-07-06T16:55:28.963+01:00 Not downloading older file in series: daily.cvd
2020-07-06T16:55:29.006+01:00 Downloading definition file /tmp/clamav_defs/main.cvd from s3://clamav_defs/main.cvd
2020-07-06T16:55:30.817+01:00 Downloading definition file /tmp/clamav_defs/main.cvd complete!
2020-07-06T16:55:30.817+01:00 Downloading definition file /tmp/clamav_defs/daily.cld from s3://clamav_defs/daily.cld
2020-07-06T16:55:31.913+01:00 Downloading definition file /tmp/clamav_defs/daily.cld complete!
2020-07-06T16:55:31.913+01:00 Downloading definition file /tmp/clamav_defs/bytecode.cvd from s3://clamav_defs/bytecode.cvd
2020-07-06T16:55:31.979+01:00 Downloading definition file /tmp/clamav_defs/bytecode.cvd complete!
2020-07-06T16:55:31.979+01:00 Starting clamscan of /tmp/bucket/documents/file.png.
2020-07-06T17:00:28.513+01:00 END RequestId: cd1e0053-5477-459d-a3c2-7dee7e125378
2020-07-06T17:00:28.513+01:00 REPORT RequestId: cd1e0053-5477-459d-a3c2-7dee7e125378 Duration: 300085.43 ms Billed Duration: 300000 ms Memory Size: 1024 MB Max Memory Used: 1025 MB Init Duration: 500.47 ms
2020-07-06T17:00:28.513+01:00 2020-07-06T16:00:28.512Z cd1e0053-5477-459d-a3c2-7dee7e125378 Task timed out after 300.09 seconds
mogusbi commented 4 years ago

I've seen the same behaviour since last Friday (3rd July).

I don't know if it's related or not but on that same day I started having trouble when building the Docker container, with this error:

Trying other mirror.

One of the configured repositories failed (Extra Packages for Enterprise Linux 7 - x86_64),

and yum doesn't have enough cached data to continue. At this point the only

safe thing yum can do is fail. There are a few ways to work "fix" this:

    1. Contact the upstream for the repository and get them to fix the problem.

    2. Reconfigure the baseurl/etc. for the repository, to point to a working

       upstream. This is most often useful if you are using a newer

       distribution release than is supported by the repository (and the

       packages for the previous distribution release still work).

    3. Run the command with the repository temporarily disabled

           yum --disablerepo=epel ...

    4. Disable the repository permanently, so yum won't use it by default. Yum

       will then just ignore the repository until you permanently enable it

       again or use --enablerepo for temporary usage:

           yum-config-manager --disable epel

       or

           subscription-manager repos --disable=epel

    5. Configure the failing repository to be skipped, if it is unavailable.

       Note that yum will try to contact the repo. when it runs most commands,

       so will have to try and fail each time (and thus. yum will be be much

       slower). If it is a very temporary problem though, this is often a nice

       compromise:

           yum-config-manager --save --setopt=epel.skip_if_unavailable=true

failure: repodata/repomd.xml from epel: [Errno 256] No more mirrors to try.

Something has changed last week and I can't work out what, why or how to fix it so that everything starts working again.

code-memento commented 4 years ago

Hi @mogusbi

Indeed, something has changed, in my case in the execution of the Lambda.

Your issue is related to the build phase, like it cannot pull epel.

Maybe you should use another repository.

mogusbi commented 4 years ago

@code-memento Sorry, I should have made myself clearer - the issue pulling down epel is intermittent so it will eventually work after a few retries. I only mentioned it as I started seeing it on the same day I then subsequently started seeing problems with my Lambda

When it does eventually build and deploy, I then see the same issue as you with the lambda function timing out when I try to scan a file

code-memento commented 4 years ago

Hi @mogusbi

Okey, so we're on the same boat 😆 .

In my case, we did not change the lambda zip. It used to work like charm up until a week or so. When I cleaned the _clamav_defs_ bucket it seemed to work for a moment but started to timeout again. Even if the timeout is 15min, it hangs till the end.

If the Lambda did not change, and the defs are not the cause, is it related to the AWS runtime :sweat_smile: ?

mogusbi commented 4 years ago

It could well be, the Amazon Linux OS was updated 8 days ago: https://hub.docker.com/_/amazonlinux?tab=tags

Although the release notes say it was updated last month: https://aws.amazon.com/amazon-linux-2/release-notes/

code-memento commented 4 years ago

@mogusbi Do you think building with the latest amazonlinux image could solve this issue ? I think it's more related to the runtime. Moreover, I think that the lambda behaves differently when the lambda container is reused (in this case the defs are not downoaded) did you notice anything about this ?

mogusbi commented 4 years ago

It hasn't fixed the problem for me

mogusbi commented 4 years ago

@code-memento yes, it looks like the issue only appears on a cold start. Subsequent requests to scan work fine once the Lambda is warm

mogusbi commented 4 years ago

@code-memento I've upped the memory of my functions from 1024 to 2048 and that appears to have fixed the issue (for now)

code-memento commented 4 years ago

It seems to work : image The lambda needs approx 1290Mb, I'll do more tests to make sure that all the cases are covered. Thanks @mogusbi

code-memento commented 4 years ago

I did many tests, it seems to do the trick, no problems so far. Thanks @mogusbi

mogusbi commented 4 years ago

That's good to hear!

I'm still slightly concerned as to why it all of a sudden needs more memory, it would be good to get to the bottom of that as throwing more memory at it is treating the symptom but not the disease

code-memento commented 4 years ago

You can say that again ! The only explanation that I found is that the clamav_defs has been updated. Thus, the lambda needs more resources for the scan.

code-memento commented 4 years ago

I found this error in the update lambda, maybe it's related :

b"ClamAV update process started at Fri Jul 10 08:32:26 2020\ndaily database available for update (local version: 25863, remote version: 25868)\nERROR: buildcld: Can't add daily.hsb to new daily.cld - please check if there is enough disk space available\nERROR: buildcld: gzclose() failed for /tmp/clamav_defs/tmp.6bd07/clamav-e2595ffff6f8a72f6094fc40802f8921.tmp\nERROR: updatedb: Incremental update failed. Failed to build CLD.\nERROR: Unexpected error when attempting to update database: daily\nWARNING: fc_update_databases: fc_update_database failed: Failed to update database (14)\nERROR: Database update process failed: Failed to update database (14)\nERROR: Update failed.\n"
Muthuveerappanv commented 4 years ago

@code-memento - you're right. the definition-update lambda is failing with this error and it is impacting the scan. did you find a fix for this:

'b"ClamAV update process started at Fri Jul 10 08:32:26 2020\ndaily database available for update (local version: 25863, remote version: 25868)\nERROR: buildcld: Can't add daily.hsb to new daily.cld - please check if there is enough disk space available\nERROR: buildcld: gzclose() failed for /tmp/clamav_defs/tmp.6bd07/clamav-e2595ffff6f8a72f6094fc40802f8921.tmp\nERROR: updatedb: Incremental update failed. Failed to build CLD.\nERROR: Unexpected error when attempting to update database: daily\nWARNING: fc_update_databases: fc_update_database failed: Failed to update database (14)\nERROR: Database update process failed: Failed to update database (14)\nERROR: Update failed.\n"

code-memento commented 4 years ago

@Muthuveerappanv The error disappears if you delete clamav_defs. AFAIK the /tmp folder is limited to 512MB.

Muthuveerappanv commented 4 years ago

@Muthuveerappanv The error disappears if you delete clamav_defs. AFAIK the /tmp folder is limited to 512MB.

u mean delete the clamav_defs on the definition s3 bucket?

code-memento commented 4 years ago

@Muthuveerappanv yes the definition bucket

DimitrijeManic commented 4 years ago

Just wondering if there are alternative solutions other than deleting clamav_defs in the s3 bucket?

culshaw commented 4 years ago

I would also like to know this, I dived deep into trying to figure this out last weekend.

I read somewhere, that somebody mentioned putting the definitions into memory directly after download to free up /tmp but I have no idea how to do this.

code-memento commented 4 years ago

@culshaw I don't see how it can be done, as the Clamav scan is by the end a command line execution with different parameters. @DimitrijeManic It's just a speculation, the true issue is that the scan needs more memory (> 1024MB) as the code didn't change for all of us, I suspect that it might be caused by the defs

DimitrijeManic commented 4 years ago

@code-memento I have set my lambda to 2048 but I believe the issue comes from the hard limit in the /tmp dir.

Possible solutions?

  1. Setup EFS as a storage solution
  2. Treat every update as if there isn't anything in the s3 clamav_defs bucket (Comment out the fetch part)

Thoughts?

DimitrijeManic commented 4 years ago

This can be reproduced by adding a volume with a size limit in scripts/run-update-lambda

#! /usr/bin/env bash

set -eu -o pipefail

#
# Run the update.lambda_handler locally in a docker container
#

rm -rf tmp/
unzip -qq -d ./tmp build/lambda.zip

NAME="antivirus-update"

# Simulate /tmp/ dir with a 512m size restriction
docker volume create --driver local --opt type=tmpfs --opt device=tmpfs --opt o=size=512m,uid=496 clamav_defs

docker run --rm \
  -v "$(pwd)/tmp/:/var/task" \
  -v clamav_defs:/tmp \
  -e AV_DEFINITION_PATH \
  -e AV_DEFINITION_S3_BUCKET \
  -e AV_DEFINITION_S3_PREFIX \
  -e AWS_ACCESS_KEY_ID \
  -e AWS_DEFAULT_REGION \
  -e AWS_REGION \
  -e AWS_SECRET_ACCESS_KEY \
  -e AWS_SESSION_TOKEN \
  -e CLAMAVLIB_PATH \
  --memory="${MEM}" \
  --memory-swap="${MEM}" \
  --cpus="${CPUS}" \
  --name="${NAME}" \
  lambci/lambda:python3.7 update.lambda_handler

hack workaround to not download existing clamav defs in update.py

# -*- coding: utf-8 -*-
# Upside Travel, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import boto3

import clamav
from common import AV_DEFINITION_PATH
from common import AV_DEFINITION_S3_BUCKET
from common import AV_DEFINITION_S3_PREFIX
from common import CLAMAVLIB_PATH
from common import get_timestamp
import shutil

def lambda_handler(event, context):
    s3 = boto3.resource("s3")
    s3_client = boto3.client("s3")

    print("Script starting at %s\n" % (get_timestamp()))

    for root, dirs, files in os.walk(AV_DEFINITION_PATH):
        for f in files:
            os.unlink(os.path.join(root, f))
        for d in dirs:
            shutil.rmtree(os.path.join(root, d))

    to_download = clamav.update_defs_from_s3(
        s3_client, AV_DEFINITION_S3_BUCKET, AV_DEFINITION_S3_PREFIX
    )

    print("Skipping clamav definition download %s\n" % (get_timestamp()))
    # for download in to_download.values():
    #     s3_path = download["s3_path"]
    #     local_path = download["local_path"]
    #     print("Downloading definition file %s from s3://%s" % (local_path, s3_path))
    #     s3.Bucket(AV_DEFINITION_S3_BUCKET).download_file(s3_path, local_path)
    #     print("Downloading definition file %s complete!" % (local_path))

    clamav.update_defs_from_freshclam(AV_DEFINITION_PATH, CLAMAVLIB_PATH)
    # If main.cvd gets updated (very rare), we will need to force freshclam
    # to download the compressed version to keep file sizes down.
    # The existence of main.cud is the trigger to know this has happened.
    if os.path.exists(os.path.join(AV_DEFINITION_PATH, "main.cud")):
        os.remove(os.path.join(AV_DEFINITION_PATH, "main.cud"))
        if os.path.exists(os.path.join(AV_DEFINITION_PATH, "main.cvd")):
            os.remove(os.path.join(AV_DEFINITION_PATH, "main.cvd"))
        clamav.update_defs_from_freshclam(AV_DEFINITION_PATH, CLAMAVLIB_PATH)
    clamav.upload_defs_to_s3(
        s3_client, AV_DEFINITION_S3_BUCKET, AV_DEFINITION_S3_PREFIX, AV_DEFINITION_PATH
    )
    print("Script finished at %s\n" % get_timestamp())
code-memento commented 4 years ago

@DimitrijeManic does this solution fix the lambda timeout issue ?

wangcarlton commented 4 years ago

Have seen similar issue recently. Try this: File size: Less than 1MB Case 1: Lambda MEM: 1024 MB, Timeout: 10 minutes, Result: Timeout after 10 minutes. Case 2: Lambda MEM: 2048MB, Timeout: 3 minutes, Result: Succeed after 21 seconds with 1299MB MEM used. So suggest use 2048MB instead, can reduce lambda timeout significantly

code-memento commented 4 years ago

@wangcarlton after some digging, it seems that clamscan is well known memory beast. The recent issues are without doubt caused by the increase of the number of virus definition.

DimitrijeManic commented 4 years ago

Increasing lambda MEM to 2048 has resolved the timeout issue however the next next problem is regarding disk space in /tmp.

The lambda will complete successfully but this error message will be in the logs

ClamAV update process started at Fri Jul 10 08:32:26 2020
daily database available for update (local version: 25863, remote version: 25868)
ERROR: buildcld: Can't add daily.hsb to new daily.cld - please check if there is enough disk space available
ERROR: buildcld: gzclose() failed for /tmp/clamav_defs/tmp.6bd07/clamav-e2595ffff6f8a72f6094fc40802f8921.tmp
ERROR: updatedb: Incremental update failed. Failed to build CLD.
ERROR: Unexpected error when attempting to update database: daily
WARNING: fc_update_databases: fc_update_database failed: Failed to update database (14)
ERROR: Database update process failed: Failed to update database (14)
ERROR: Update failed.

So maybe this issue is resolved and we can continue the discussion in https://github.com/upsidetravel/bucket-antivirus-function/issues/128 ?

wangcarlton commented 4 years ago

I am using this: https://github.com/upsidetravel/bucket-antivirus-function I guess this is new issue comes after upgrade from 0.102.2 to 0.102.3. I was trying to solve it today, but seems use other directory(such as /var/task) is prohibited by AWS. Lambda has a fixed 500MB storage which can't be changed https://aws.amazon.com/lambda/faqs/

Q: What if I need scratch space on disk for my AWS Lambda function?
Each Lambda function receives 500MB of non-persistent disk space in its own /tmp directory.

It also took me this whole afternoon to figure out that some libs(such as libprelude etc.) need to be installed, env path needs to be updated when run freshclam after the upgrade from 0.102.2 to 0.102.3. I am going to migrate the lambda to an EC2(more stable and under control) to update definition file.

ostigley commented 4 years ago

thanks @DimitrijeManic , your snippet fixed my update.py issues (running out of space)

anuj9196 commented 4 years ago

Increasing the memory worked for me

dalekurt commented 3 years ago

Same here, increasing the memory did the trick.

baartch commented 3 years ago

Increasing lambda MEM to 2048 has resolved the timeout issue however the next next problem is regarding disk space in /tmp.

Lambda has a fixed 500MB storage which can't be changed

Guys, I know, it's a long time ago you wrote this. I just want to mention, that there is the possibility to append an EFS (Elastic File System) to a Lambda, an then you have nearly unlimited storage available.

https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/

just make sure to avoid this error: https://github.com/aws/serverless-application-model/issues/1631#issuecomment-648049879

and note, you have to delete the files by yourself after scanning

abhinavbom commented 3 years ago

The best solution for this problem is to increase the memory to 2048MB. Thanks folks.