aws / aws-sam-cli

CLI tool to build, test, debug, and deploy Serverless applications using AWS SAM
https://aws.amazon.com/serverless/sam/
Apache License 2.0
6.5k stars 1.17k forks source link

Bug: sam build -u stuck on mounting #7387

Open BFlores16 opened 3 weeks ago

BFlores16 commented 3 weeks ago

Description:

I have a Gitlab pipeline that builds and deploys my SAM application. My application contains about 30 lambdas with mostly python and some node. I have never had an issue when I run sam build -u locally. But when running the command in my pipeline, the pipeline hangs on the last function and gets stuck on "Mounting /builds// /tmp/samcli/source:ro,delegated, inside runtime container"

In order to resolve this, I have to delete all artifacts in my repo and then clear runner caches. Then the pipeline will work once with sam build, and then get stuck again on subsequent runs. I've tried modifying my gitlab-ci.yml in many ways with no success.

Here is my gitlab-ci.yml

variables:
  SAM_TEMPLATE: Lambdas/template.yaml
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: "/certs"

services:
  - docker:23.0.6-dind

# Should always specify a specific version of the image. If using a tag like docker:stable,
# there will be no control over which version is used. Unpredictable behavior can result.
image: docker:23.0.6

before_script:
  - apk add --update python3 py-pip python3-dev build-base libffi-dev
  - pip install --upgrade pip
  - pip install awscli aws-sam-cli

stages:
  - preview
  - deploy 

preview:
  stage: preview
  script:
    - chmod 755 aws-variables.sh
    - ./aws-variables.sh
    - export AWS_DEFAULT_REGION=$AWS_REGION
    - cd Lambdas
    - sam build -u
    - sam deploy --region $AWS_REGION --no-execute-changeset --no-fail-on-empty-changeset
    - cd ..
    - changeset_id=$(aws cloudformation describe-change-set --stack-name Lambdas --change-set-name $(aws cloudformation list-change-sets --stack-name Lambdas --query "sort_by(Summaries, &CreationTime)[-1].ChangeSetName" --output text) --query "ChangeSetId" --output text)
    - echo $changeset_id > changeset.txt
  artifacts:
    paths:
      - changeset.txt

deploy-prod:
  stage: deploy
  script:
    - chmod 755 aws-variables.sh
    - ./aws-variables.sh
    - changeset_id=$(cat changeset.txt)
    - export AWS_DEFAULT_REGION=$AWS_REGION
    - aws cloudformation execute-change-set --change-set-name $changeset_id
  only:
    - main
    - develop
  when: manual
  environment:
    name: production

Here is some of the end output from sam build -u --debug

pip stderr: b"WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.\nPlease see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.\nTo avoid this problem you can invoke Python with '-m pip' instead of running pip directly.\n\n[notice] A new release of pip is available: 23.0.1 -> 24.2\n[notice] To update, run: pip install --upgrade pip\n"
Full dependency closure: {jmespath==1.0.1(wheel), python-dateutil==2.9.0.post0(wheel), six==1.16.0(wheel), boto3==1.35.2(wheel), botocore==1.35.2(wheel), s3transfer==0.10.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel)}
initial compatible: {jmespath==1.0.1(wheel), python-dateutil==2.9.0.post0(wheel), boto3==1.35.2(wheel), six==1.16.0(wheel), botocore==1.35.2(wheel), s3transfer==0.10.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel)}
initial incompatible: set()
Downloading missing wheels: set()
compatible wheels after second download pass: {jmespath==1.0.1(wheel), python-dateutil==2.9.0.post0(wheel), six==1.16.0(wheel), boto3==1.35.2(wheel), botocore==1.35.2(wheel), s3transfer==0.10.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel)}
Build missing wheels from sdists (C compiling True): set()
compatible after building wheels (no C compiling): {jmespath==1.0.1(wheel), python-dateutil==2.9.0.post0(wheel), six==1.16.0(wheel), boto3==1.35.2(wheel), botocore==1.35.2(wheel), s3transfer==0.10.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel)}
Build missing wheels from sdists (C compiling False): set()
compatible after building wheels (C compiling): {jmespath==1.0.1(wheel), python-dateutil==2.9.0.post0(wheel), six==1.16.0(wheel), boto3==1.35.2(wheel), botocore==1.35.2(wheel), s3transfer==0.10.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel)}
Final compatible: {jmespath==1.0.1(wheel), python-dateutil==2.9.0.post0(wheel), six==1.16.0(wheel), boto3==1.35.2(wheel), botocore==1.35.2(wheel), s3transfer==0.10.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel)}
Final incompatible: set()
Final missing wheels: set()
PythonPipBuilder:ResolveDependencies succeeded
 Running PythonPipBuilder:CopySource
Copying source file (/tmp/samcli/source/requirements.txt) to destination (/tmp/samcli/artifacts/requirements.txt)
Copying source file (/tmp/samcli/source/lambda_function.py) to destination (/tmp/samcli/artifacts/lambda_function.py)
PythonPipBuilder:CopySource succeeded
Full dependency closure: {jmespath==1.0.1(wheel), boto3==1.35.2(wheel), python-dateutil==2.9.0.post0(wheel), s3transfer==0.10.2(wheel), botocore==1.35.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel), six==1.16.0(wheel)}
initial compatible: {jmespath==1.0.1(wheel), boto3==1.35.2(wheel), python-dateutil==2.9.0.post0(wheel), s3transfer==0.10.2(wheel), botocore==1.35.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel), six==1.16.0(wheel)}
initial incompatible: set()
Downloading missing wheels: set()
compatible wheels after second download pass: {jmespath==1.0.1(wheel), boto3==1.35.2(wheel), python-dateutil==2.9.0.post0(wheel), s3transfer==0.10.2(wheel), botocore==1.35.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel), six==1.16.0(wheel)}
Build missing wheels from sdists (C compiling True): set()
compatible after building wheels (no C compiling): {jmespath==1.0.1(wheel), boto3==1.35.2(wheel), python-dateutil==2.9.0.post0(wheel), s3transfer==0.10.2(wheel), botocore==1.35.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel), six==1.16.0(wheel)}
Build missing wheels from sdists (C compiling False): set()
compatible after building wheels (C compiling): {jmespath==1.0.1(wheel), boto3==1.35.2(wheel), python-dateutil==2.9.0.post0(wheel), s3transfer==0.10.2(wheel), botocore==1.35.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel), six==1.16.0(wheel)}
Final compatible: {jmespath==1.0.1(wheel), boto3==1.35.2(wheel), python-dateutil==2.9.0.post0(wheel), s3transfer==0.10.2(wheel), botocore==1.35.2(wheel), aws-psycopg2==1.3.8(wheel), urllib3==1.26.19(wheel), six==1.16.0(wheel)}
Final incompatible: set()
Final missing wheels: set()
PythonPipBuilder:ResolveDependencies succeeded
 Running PythonPipBuilder:CopySource
Copying source file (/tmp/samcli/source/requirements.txt) to destination (/tmp/samcli/artifacts/requirements.txt)
Copying source file (/tmp/samcli/source/lambda_function.py) to destination (/tmp/samcli/artifacts/lambda_function.py)
2024-08-20 22:54:08,878 | Build inside container returned response {"jsonrpc": "2.0", "id": 1, "result": {"artifacts_dir": "/tmp/samcli/artifacts"}}
2024-08-20 22:54:08,878 | Build inside container was successful. Copying artifacts from container to host
PythonPipBuilder:CopySource succeeded
2024-08-20 22:54:08,881 | Build inside container returned response {"jsonrpc": "2.0", "id": 1, "result": {"artifacts_dir": "/tmp/samcli/artifacts"}}
2024-08-20 22:54:08,882 | Build inside container was successful. Copying artifacts from container to host
2024-08-20 22:54:16,214 | Copying from container: /tmp/samcli/artifacts/. -> /builds/nroc/aws-lambdas/Lambdas/.aws-sam/build/DBRotatePasswordDevelopment
2024-08-20 22:54:16,234 | Copying from container: /tmp/samcli/artifacts/. -> /builds/nroc/aws-lambdas/Lambdas/.aws-sam/build/DBRotatePasswordProduction
2024-08-20 22:54:20,172 | Build inside container succeeded
2024-08-20 22:54:20,188 | Build inside container succeeded
Terminated
WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts
ERROR: Job failed: execution took longer than 1h0m0s seconds

Steps to reproduce:

  1. Change code and push
  2. Create merge request
  3. Pipeline runs
  4. Pipeline gets stuck on sam build -u

Observed result:

Pipeline gets stuck on sam build -u

Expected result:

Build should succeed and sam deploy should proceed

Additional environment details (Ex: Windows, Mac, Amazon Linux etc)

  1. OS: MacOS 13.1
  2. sam --version: 1.123.0
  3. AWS region: us-west-1
# Paste the output of `sam --info` here

{ "version": "1.123.0", "system": { "python": "3.11.8", "os": "Linux-5.15.154+-x86_64-with" }, "additional_dependencies": { "docker_engine": "23.0.6", "aws_cdk": "Not available", "terraform": "Not available" }, "available_beta_feature_env_vars": [ "SAM_CLI_BETA_FEATURES", "SAM_CLI_BETA_BUILD_PERFORMANCE", "SAM_CLI_BETA_TERRAFORM_SUPPORT", "SAM_CLI_BETA_RUST_CARGO_LAMBDA" ] }

Add --debug flag to command you are running

dkphm commented 3 weeks ago

Hi @BFlores16 , thank you for reporting the issue.

It looks like to me that docker was running out of memory on the host that run sam build -u. Could you please provide us the infrastructure detail of the host that run the job? (CPU, RAM, Disk space, ...).

You can also try splitting into smaller batches of lambdas and see whether it is still failing.

BFlores16 commented 3 weeks ago

Hi @BFlores16 , thank you for reporting the issue.

It looks like to me that docker was running out of memory on the host that run sam build -u. Could you please provide us the infrastructure detail of the host that run the job? (CPU, RAM, Disk space, ...).

You can also try splitting into smaller batches of lambdas and see whether it is still failing.

I'm not sure how I would split into smaller batches of lambdas, could you suggest how?

I am using the GitLab shared runners, which should have the following configs:

Here's some info I output in my pipeline.

$ echo "===== Memory Info ====="
===== Memory Info =====
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       569Mi       5.8Gi       1.0Mi       1.7Gi       7.2Gi
Swap:          2.0Gi          0B       2.0Gi
$ echo "===== Disk Space Info ====="
===== Disk Space Info =====
$ df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  25.4G      7.7G     17.6G  31% /
tmpfs                    64.0M         0     64.0M   0% /dev
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/sda1                25.4G      7.7G     17.6G  31% /builds
/dev/sda1                25.4G      7.7G     17.6G  31% /certs/client
/dev/sda1                25.4G      7.7G     17.6G  31% /etc/resolv.conf
/dev/sda1                25.4G      7.7G     17.6G  31% /etc/hostname
/dev/sda1                25.4G      7.7G     17.6G  31% /etc/hosts
/dev/sda1                25.4G      7.7G     17.6G  31% /var/lib/docker
tmpfs                     3.9G         0      3.9G   0% /sys/devices/virtual/dmi/id
$ echo "===== Docker Info ====="
===== Docker Info =====
$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/local/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.22.0
    Path:     /usr/local/libexec/docker/cli-plugins/docker-compose
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 23.0.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.154+
 Operating System: Alpine Linux v3.18 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.768GiB
 Name: 08ef878f2db3
 ID: a2b3703f-1391-4b0a-8cc4-4d59185b43eb
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine
dkphm commented 3 weeks ago

Hi @BFlores16 , thank you for the info.

I assumed the memory was captured when the pipeline was not running? If it is true, is there a way that you can check the host's metrics to confirm there was no spike of memory or disk usage when the pipeline was running?

Also, look like you are running sam build against Linux, so maybe there is no need to use a containerized build for this case. Another option I would suggest you to try is running sam build without a -u option to see whether it works?

In term of splitting the number of lambdas into smaller batches, you can try having multiple templates to deploy a smaller set of lambdas. Be note that this will result into a multiple Cloudformation stacks.

BFlores16 commented 3 weeks ago

The memory was captured during the pipeline run but prior to the sam build command. I don't think there is a way for me to evaluate the memory usage during the sam build command.

Sam build without the -u does not work for me because many of my lambdas are being containerized due to requirements.txt files. I may be ignorant on how best to package my dependencies and deploy them in the pipeline so if you have any suggestions I would welcome them please. Here's an example error without -u flag

Build Failed Error: PythonPipBuilder:Validation - Binary validation failed for python, searched for python in following locations : ['/usr/bin/python', '/usr/bin/python3'] which did not satisfy constraints for runtime: python3.9. Do you have python for runtime: python3.9 on your PATH? Cleaning up project directory and file based variables 00:00 ERROR: Job failed: exit code 1

dkphm commented 3 weeks ago

Hi @BFlores16 , thank you for the prompt response. I would like to dive deeper into the logs to see what could went wrong, could you please sent us the full output of sam build -u --debug running on the pipeline?

Thanks!

BFlores16 commented 3 weeks ago

Here is the log from a previous run with the --debug flag on

sam-build-log.txt

dkphm commented 3 weeks ago

Thanks @BFlores16 for the logs.

Looks like I was able to reproduce the issue, this is likely because one of the container didn't return to the main thread after finishing the build. We are working on the fix and hopefully it could be available in the next release.

In the meantime, if you could install Python (with the same version as defined in template.yaml) to the host and run sam build without -u, I believe it will help you unblock the pipeline.

BFlores16 commented 2 weeks ago

I ended up changing all my python lambdas to a consistent version and installed python and node on my container. I also removed the -u flag. These were the added benefits:

I may check one day to see if the team ever actually fixes the bug reported but I doubt I would ever switch back to using the flag as this is quite convenient.

The only benefit I can see using the flag is that you wouldn't need to install your runtime versions explicitly and maintain your pipeline as much. That is probably what contributes to the increased run time though.

Here is how I modified my pipeline file:

variables:
  SAM_TEMPLATE: Lambdas/template.yaml
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: "/certs"

services:
  - docker:23.0.6-dind

image: docker:23.0.6

before_script:
  - apk add --update python3 py3-pip python3-dev build-base libffi-dev util-linux procps
  - apk add nodejs npm
  - if ! python3 --version | grep -q "3.11"; then apk add --repository=http://dl-cdn.alpinelinux.org/alpine/edge/community python3=3.11*; fi
  - ln -sf /usr/bin/python3 /usr/local/bin/python
  - ln -sf /usr/bin/node /usr/local/bin/node
  - pip install --upgrade pip
  - pip install awscli aws-sam-cli

stages:
  - preview
  - deploy 

preview:
  stage: preview
  timeout: 20m
  script:
    - chmod 755 aws-variables.sh
    - ./aws-variables.sh
    - export AWS_DEFAULT_REGION=$AWS_REGION
    - cd Lambdas
    - sam build
    - sam deploy --region $AWS_REGION --no-execute-changeset --no-fail-on-empty-changeset
    - cd ..
    - changeset_id=$(aws cloudformation describe-change-set --stack-name Lambdas --change-set-name $(aws cloudformation list-change-sets --stack-name Lambdas --query "sort_by(Summaries, &CreationTime)[-1].ChangeSetName" --output text) --query "ChangeSetId" --output text)
    - echo $changeset_id > changeset.txt
  artifacts:
    paths:
      - changeset.txt

deploy-prod:
  stage: deploy
  script:
    - chmod 755 aws-variables.sh
    - ./aws-variables.sh
    - changeset_id=$(cat changeset.txt)
    - export AWS_DEFAULT_REGION=$AWS_REGION
    - aws cloudformation execute-change-set --change-set-name $changeset_id
  only:
    - main
    - develop
  when: manual
  environment:
    name: production

I would still be interested to see if this bug is ever fixed as it makes it easier to package dependencies as I won't have to install and maintain specific versions

dkphm commented 2 weeks ago

Hi @BFlores16 , thank you for the feedback.

Yes building without -u would be much faster and easier, but sometimes it is troublesome to maintain the environments and dependencies in all the hosts.

The issue has been added to our backlog and we will try to find a solution for this.