aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.
https://aws.github.io/copilot-cli/
Apache License 2.0
3.53k stars 417 forks source link

Routing an ALB healthcheck to a sidecar? #5931

Closed CharlieDigital closed 2 months ago

CharlieDigital commented 2 months ago

I have a Neo4j Community container that I've deployed into ECS using a load balanced web service as Neo4j exposes a web front-end and this makes it easier to connect everything together.

An EFS volume is mounted to maintain the file system between deployments.

The ALB health check was initially pointed to Neo4j's port 7474, but there's a conundrum here as Neo4j Community does not support online backups. So if I neo4j stop, this the obviously also triggers the health check to fail and the container cycles before I can take the backup and neo4j start again.

My approach was to consider deploying a sidecar that would respond to the health check from the ALB and route the health check to the sidecar where I can just respond with 200 (for testing) or add some other custom logic to monitor the health status of the Neo4j container.

However, I'm not clear that this is possible.

  1. Is there another pattern that is more suitable to use here? The Neo4j instance works fine deployed like this, but I currently cannot stop the service for an offline backup.
  2. Is it possible to route the health check to the sidecar? What would that configuration look like? Is the sidecar already connected to the ALB as well?

The configuration seems correct:

image

The sidecar is listening on port 7470 and exposes the /health endpoint which returns an HTTP 200.

But stopping the Neo4j service causes the container to be restarted.

Manifest below:

# Your service name will be used in naming your resources like log groups, ECS services, etc.
name: neo4j-db
type: Load Balanced Web Service

# We use a sidecar to respond to the healthcheck so we can stop the neo4j instance
sidecars:
  health:
    port: 7470
    image: ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/our-apps/health-sidecar

# Configuration for your containers and service.
image:
  location: docker.io/neo4j:5-community
  # Port exposed through your container to route traffic to it.
  port: 7474
  depends_on:
    health: start

cpu: 1024       # Number of CPU units for the task.
memory: 2048    # Amount of memory in MiB used by the task.
count: 1       # Number of tasks that should be running in your service.
exec: true     # Enable running commands in your container.
network:
  connect: true # Enable Service Connect for intra-environment traffic between services.

# See EFS: https://aws.github.io/copilot-cli/docs/developing/storage/#managed-efs
# This is the path inside the container
storage:
  volumes:
    neo4j_data_volume:
      efs:
        uid: 7474 # The UID of the neo4j user via id -u neo4j
        gid: 7474 # The GID of the neo4j user via id -g neo4j
      path: /data
      read_only: false

# This is a workaround; see:
# - https://github.com/aws/copilot-cli/issues/5907
# - https://github.com/aws/copilot-cli/issues/1292
secrets:
  NEO4J_PLUGINS: /copilot/${COPILOT_APPLICATION_NAME}/${COPILOT_ENVIRONMENT_NAME}/secrets/NEO4J_PLUGINS

variables:
  NEO4J_apoc_export_file_enabled: true
  NEO4J_apoc_import_file_enabled: true
  NEO4J_apoc_import_file_use__neo4j__config: true
  #NEO4J_PLUGINS: "['apoc', 'apoc-extended', 'graph-data-science']"
  NEO4J_dbms_security_procedures_unrestricted: apoc.*,gds.*,algo.*,spatial.*

# Cannot add a certificate to the NLB; must manually do it or use CF
nlb:
  port: 7687/tcp
  target_port: 7687
  stickiness: true

# Force recreate since Neo4j is holding a lock on the file system.
deployment:
  rolling: recreate

# Distribute traffic to your service.
http:
  # Import the existing ownit-shared-lb
  alb: arn:aws:elasticloadbalancing:us-east-1:ACCOUNT:loadbalancer/app/shared-beta-lb/RESOURCE_ID
  path: "/"
  deregistration_delay: 5s # Speeds up deploys
  redirect_to_https: true
  alias: "domain.example.com"
  hosted_zone: "ZONE_ID"
  healthcheck:
    path: "/health"
    port: 7470
    healthy_threshold: 2
    unhealthy_threshold: 3
    grace_period: 240s
Lou1415926 commented 2 months ago

Is it possible to route the health check to the sidecar? What would that configuration look like? Is the sidecar already connected to the ALB as well?

Your configuration in the manifest seems fine to me at the first glance. The sidecar health is listening on path '/health' on 7470, and the health check traffic is routed to that exact place. It should be the sidecar that is responding to the health check. Is that what you observed as well?

But stopping the Neo4j service causes the container to be restarted.

Sounds like running neo4j stop would kill the main container, and you want to be able to restart the neo4j without killing the container. Is that correct? In this case, would this help?

CharlieDigital commented 2 months ago

Your configuration in the manifest seems fine to me at the first glance. The sidecar health is listening on path '/health' on 7470, and the health check traffic is routed to that exact place. It should be the sidecar that is responding to the health check. Is that what you observed as well?

Appreciate the sanity check!

In following up with Neo4j, it seems like an issue with their container configuration and the solution is to customize the container to allow stopping the service.