Ensure Capture Nodes Autoscale

chelma commented 1 year ago

Description

The Capture Nodes use ECS-on-EC2 for their compute. However, it's unclear whether the current CDK configuration will actually enable scaling of the containers as expected when their CPU/Memory usage increases. This task is to ensure the ECS capture containers do scale up to the limit provided by their backing EC2 ASG.

Acceptance Criteria

Demonstrate the ability of the Capture Nodes to automatically scale up/down within the boundary of the backing EC2 ASG
Update code/configuration as necessary to achieve that demo

chelma commented 1 year ago

After thinking about this task a bit, it seems to present a need for a better way to generate test traffic than our existing demo generators allow. Specifically, the current demo generators hit third-party websites (Alexa top 100) that we don't own. The amount of traffic we're currently driving against them is negligible, but in order to stress-test our capture setups we'll want to drive substantial volumes of traffic through our mirroring mechanism. Therefore, the responsible (and practical) thing to do seems to be to create our own traffic sink(s) to receive our test traffic, and update our traffic generation mechanism drive more traffic per host.

Basically, I'd propose that we create a new pair of top-level CLI commands: create-stress-test-setup and destroy-stress-test-setup.

create-stress-test-setup
- Create one or more VPCs containing ECS Fargate tasks that execute a Docker container that will generate large volumes of traffic to a specified location
- Create a VPC containing ECS Fargate tasks that receive/respond to the traffic generated
- Communication between source and sink will occur over the public internet to trigger our mirroring filter
destroy-stress-test-setup
- Tear down the CDK stacks for the test bed

chelma commented 1 year ago

It looks like there's a few tools we can use as a traffic sink. HTTPBin even has an official Docker image we can simply reuse.

FROM kennethreitz/httpbin:latest

# Expose the port the app runs on
EXPOSE 80

Sample CDK Snippet to generate the sink VPC:

import * as cdk from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';

export class TrafficSinkStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create a VPC
    const vpc = new ec2.Vpc(this, 'VPC', {
      maxAzs: 2,
    });

    // Create an ECS cluster
    const cluster = new ecs.Cluster(this, 'Cluster', {
      vpc: vpc,
    });

    // Create a Fargate service
    const fargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
      this,
      'FargateService',
      {
        cluster: cluster,
        taskImageOptions: {
          image: ecs.ContainerImage.fromAsset(__dirname, {
            file: 'Dockerfile',
          }),
          containerPort: 80,
        },
        publicLoadBalancer: true,
      }
    );

    // Output the DNS name of the ALB
    new cdk.CfnOutput(this, 'LoadBalancerDNS', {
      value: fargateService.loadBalancer.loadBalancerDnsName,
    });
  }
}

awick commented 1 year ago

In general I'm concerned about capture auto scaling because we need the same traffic flows to go to the same capture instances. It looks like the GWLB handles scaling up by using sticky flows, its the scaling down that I want to make sure we test well. It does look like the GWLB supports a 350 second draining state where it will continue to send old flows to a draining target, but not new flows. We probably should add something about this to the acceptance criteria, that we are deregistering on scale down, waiting the 350s to terminate, etc.

This is all to say that initially it is more important that we get the create-cluster --expected-gbps (#34) feature implemented first and use that for initial scaling.

It also means that for testing we should not just test by size of flows but number of flows also.

We should also decide what unit of capture we want each capture instance to handle?

chelma commented 8 months ago

I spent some time testing the scaling-up of our ECS-on-EC2 Capture Nodes as part of https://github.com/arkime/aws-aio/issues/147. It's currently not working how I would expect and I'm unsure on how to get it working without a further deep dive.

Based on the ECS docs/Blogs [1] [2], what should happen is the following:

Demand increases on the ECS containers (mem, cpu, etc), exceeding the set limit according to the ECS Service's scaling policy
A CloudWatch Alarm created by the ECS Service's scaling policy should fire, kicking off a CloudWatch Action to place another Task in the EC2 ASG.
If there is not enough space in the ASG to fit another ECS Task, it continues with creating the task but places it in the PROVISIONING state.
ECS Tasks in the PROVISIONING state increase the CloudWatch Metric CapacityProviderReservation in the namespace AWS/ECS/ManagedScaling, with a separate metric for each ECS Service. When the metric goes over 100%, that tells the associated ASG to provision new instances up to its own scaling limits.
The ASG spins up new instances according to the combined scaling policy
ECS attempts to place the PROVISIONING Tasks onto the new Instances, continuing the process until the scaling limits are reached or all Tasks are placed.

With our current CDK code, everything appears to be set up and linked correctly, but when the ECS Service attempts to spin up a new Task and finds there isn't room (Step 3), no Tasks are created in the PROVISIONING state so the metric the linked ASG is looking at to scale (step 4) never increases. Instead, we just get the standard (and expected) "unable to place a task because no container instance met all of its requirements" error message in the Service event history which is supposed to precede Tasks being stuck in a PROVISIONING state.

[1] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-auto-scaling.html [2] https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/

arkime / aws-aio

Ensure Capture Nodes Autoscale #31

Description

Acceptance Criteria