gchux / cloud-run-tcpdump

Cloud Run packet capturing sidecar
Apache License 2.0
7 stars 2 forks source link
cloud cloudrun cloudrun-service networking pcap tcpdump

Cloud Run tcpdump sidecar

This repository contains the source code to create a container image containing tcpdump and pcap-cli to perform packet capture in Cloud Run multi-container deployments.

Captured packets are optionally translated to JSON and written into Cloud Logging

alt text

Motivation

During development, it is often useful to perform packet capturing to troubleshoot specific/gnarly network related conditions/issues.

This container image is to be used as a sidecar of the Cloud Run main –ingress– container in order to perform a packet capture using tcpdump within the same network namespace.

The sidecar approach enables decoupling from the main –ingress– container so that it does not require any modifications to perform a packet capture; additionally, sidecars use their own resources which allows tcpdump to not compete with the main app resources allocation.

NOTE: the main –ingress– container is the one to which all ingress traffic ( HTTP Requests ) is delivered to; for Cloud Run services, this is typically your APP container.

Features

Building blocks

How it works

The sidecar uses:

Prebuilt image flavors

The pcap sidecar has images that are compatible with both Cloud Run execution environments.

[!IMPORTANT]

  • The gen1 images are compatible for BOTH gen1 and gen2 Cloud Run execution environments.
  • The gen2 images are compatible for ONLY the gen2 Cloud Run execution environment.

This is because gen1 does not support the newest version of libpcap, whereas gen2 does.

How to deploy to Cloud Run

  1. Define environment variables to be used during Cloud Run service deployment:

    export SERVICE_NAME='...'           # Cloud Run or App Engine Flex service name
    export SERVICE_REGION='...'         # GCP Region: https://cloud.google.com/about/locations
    export SERVICE_ACCOUNT='...'        # Cloud Run service's identity
    export INGRESS_CONTAINER_NAME='...' # the name of the ingress container i/e: `app`
    export INGRESS_IMAGE_URI='...'
    export INGRESS_PORT='...'
    export TCPDUMP_SIDECAR_NAME='...'   # the name of the pcap sidecar i/e: `pcap-sidecar`
    # public image compatible with both gen1 & gen2. Alternatively build your own
    export TCPDUMP_IMAGE_URI='us-central1-docker.pkg.dev/pcap-sidecar/pcap-sidecar/pcap-sidecar:latest'
    export PCAP_IFACE='eth'             # prefix of the interface in which packets should be captured from
    export PCAP_GCS_BUCKET='...'        # the name of the Cloud Storage Bucket to mount
    export PCAP_FILTER='...'            # the BPF filter to use; i/e: `tcp port 443`
    export PCAP_JSON_LOG=true           # set to `true` for writting structured logs into Cloud Logging
  2. Deploy the Cloud Run service including the tcpdump sidecar:

[!NOTE]
If adding the tcpdump sidecar to a preexisting Cloud Run service that is a single container service the gcloud command will fail.

You will need to instead make these updates via the Cloud Console or create a new Cloud Run service.

gcloud run deploy ${SERVICE_NAME} \
  --project=${PROJECT_ID} \
  --region=${SERVICE_REGION} \
  --service-account=${SERVICE_ACCOUNT} \
  --container=${INGRESS_CONTAINER_NAME} \
  --image=${INGRESS_IMAGE_URI} \
  --port=${INGRESS_PORT} \
  --container=${TCPDUMP_SIDECAR_NAME} \
  --image=${TCPDUMP_IMAGE_URI} \
  --cpu=1 --memory=1G \
  --set-env-vars="PCAP_IFACE=${PCAP_IFACE},PCAP_GCS_BUCKET=${PCAP_GCS_BUCKET},PCAP_FILTER=${PCAP_FILTER},PCAP_JSON_LOG=${PCAP_JSON_LOG} \

See the full list of available flags for gcloud run deploy at https://cloud.google.com/sdk/gcloud/reference/run/deploy

  1. All containers need to depend on the tcpdump sidecar, but this configuration is not available via gcloud due to needing to configure healthchecks for the sidecar container. To make all containers depend on the tcpdump sidecar, edit the Cloud Run service via the Cloud Console and make all other containers depend on the tcpdump sidecar and add the following TCP startup probe healthcheck to the tcpdump sidecar:
startupProbe:
  timeoutSeconds: 1
  periodSeconds: 10
  failureThreshold: 10
  tcpSocket:
    port: 12345

You can optionally choose a different port by setting PCAP_HC_PORT as an env var of the tcpdump sidecar

Available configurations

The tcpdump sidecar accepts the following environment variables:

Advanced configurations

More advanced use cases may benefit from scheduling tcpdump executions. Use the following environment variables to configure scheduling:

Considerations

Download and Merge all PCAP Files

  1. Use Cloud Logging to look for the entry starting with: [INFO] - PCAP files available at: gs://...

    It may be useful to use the following filter:

    resource.type = "cloud_run_revision"
    resource.labels.service_name = "<cloud-run-service-name>"
    resource.labels.location = "<cloud-run-service-region>"
    "<cloud-run-revision-name>"
    "PCAP files available at:"

    This entry contains the exact Cloud Storate path to be used to download all the PCAP files.

    Copy the full path including the prefix gs://, and assign it to the environment variable GCS_PCAP_PATH.

  2. Download all PCAP files using:

    mkdir pcap_files
    cd  pcap_files
    gcloud storage cp ${GCS_PCAP_PATH}/*.gz . # use `${GCS_PCAP_PATH}/*.pcap` if `PCAP_COMPRESS` was set to `false`
  3. If PCAP_COMPRESS was set to true, uncompress all the PCAP files: gunzip ./*.gz

  4. Merge all PCAP files into a single file:

    • for .pcap files: mergecap -w full.pcap -F pcap ./*_part_*.pcap

    • for .json files: cat *_part_*.json | jq -crMs 'sort_by(.pcap.date)' > pcap.json

    See mergecap docs: https://www.wireshark.org/docs/man-pages/mergecap.html

    See jq docs: https://jqlang.github.io/jq/manual/ , JSON pcaps are particularly useful when Wireshark is not available.


How to build the sidecar yourself

  1. Define the PROJECT_ID environment variable; i/e: export PROJECT_ID='...'.

  2. Clone this repository:

    git clone --depth=1 --branch=main --single-branch https://github.com/gchux/cloud-run-tcpdump.git

[!TIP] If you prefer to let Cloud Build perform all the tasks, go directly to build using Cloud Build

  1. Move into the repository local directory: cd cloud-run-tcpdump.

Continue with one of the following alternatives:

Using a local environment or Cloud Shell

  1. Build and push the tcpdump sidecar container image:

    export TCPDUMP_IMAGE_URI='...'   # this is usually Artifact Registry e.g. '${_REPO_LOCATION}-docker.pkg.dev/${PROJECT_ID}/${_REPO_NAME}/${_IMAGE_NAME}'
    export RUNTIME_ENVIRONMENT='...' # either 'cloud_run_gen1' or 'cloud_run_gen2'
    ./docker_build ${RUNTIME_ENVIRONMENT} ${TCPDUMP_IMAGE_URI}

Using Cloud Build

This approach assumes that Artifact Registry is available in PROJECT_ID.

  1. Define the following environment variables:

    export REPO_LOCATION='...' # Artifact Registry Docker repository location e.g. us-central1
    export REPO_NAME='...'     # Artifact Registry Docker repository name
    export IMAGE_NAME='...'    # container image name; i/e: `pcap-sidecar`
  2. Build and push the tcpdump sidecar container image using Cloud Build:

    gcloud builds submit \
     --project=${PROJECT_ID} \
     --config=$(pwd)/cloudbuild.yaml \
     --substitutions='_REPO_LOCATION=${REPO_LOCATION},_REPO_NAME=${REPO_NAME},_IMAGE_NAME=${IMAGE_NAME}' $(pwd)

See the full list of available flags for gcloud builds submit: https://cloud.google.com/sdk/gcloud/reference/builds/submit

Using with App Engine Flexible

  1. Enable debug mode an App Engine Flexible instance: https://cloud.google.com/appengine/docs/flexible/debugging-an-instance#enabling_and_disabling_debug_mode

  2. Connect to the instnace using SSH: https://cloud.google.com/appengine/docs/flexible/debugging-an-instance#connecting_to_the_instance

  3. Escalate privileges; execute: sudo su

  4. Create the following env file named pcap.env, use the following sample to define sidecar variables:

    # $ touch pcap.env
    PCAP_GAE=true
    PCAP_GCS_BUCKET=the-gcs-bucket    # the name of the Cloud Storage bucket used to store PCAP files
    GCS_MOUNT=/gae/pcap               # where to mount the Cloud Storage bucket within the container FS
    PCAP_IFACE=eth                    # network interface prefix
    PCAP_FILTER=tcp or udp            # BPF filter to scope packet capturing to specific network traffic
    PCAP_SNAPSHOT_LENGTH=0
    PCAP_USE_CRON=false               # do not schedule packet capturing
    PCAP_TIMEZONE=America/Los_Angeles
    PCAP_TIMEOUT_SECS=60
    PCAP_ROTATE_SECS=30
    PCAP_TCPDUMP=true
    PCAP_JSON=true
    PCAP_JSON_LOG=false               # NOT necessary, packet translations are streamed directly to Cloud Logging
    PCAP_ORDERED=false
  5. Create a directory to store the PCAP files in the host filesystem: mkdir gae

  6. Pull the sidecar container image: docker --config=/etc/docker pull ${TCPDUMP_IMAGE_URI}

  7. Run the sidecar to start capturing packets:

    docker run --rm --name=pcap -it \
      --cpus=1 --cpuset-cpus=1 \
      --privileged --network=host \
      --env-file=./pcap.env \
      -v ./gae:/gae -v /var/log:/var/log \
      -v /var/run/docker.sock:/docker.sock \
      ${TCPDUMP_IMAGE_URI} nsenter -t 1 -u -n -i /init \
      >/var/log/app_engine/app/STDOUT_pcap.log \
      2>/var/log/app_engine/app/STDERR_pcap.log

NOTE: for GAE Flex: it is strongly recommended to not use PCAP_FILTER=tcp or udp ( or even tcp port 443 ) as packets are streamed into Cloud Logging using its gRPC API,

which means that traffic is HTTP/2 over TCP and so if you capture all TCP and UDP traffic you'll also be capturing all what's being exported into Cloud Logging which will cause a

write aplification effect that will starve memory as all your traffic will eventually be stored in sidecar's memory.