Incorrect environment manifest documentation leads to unhelpful YAML "unmarshal" error

Offlein commented 1 year ago

Since I upgraded to 1.30.0 today, my deploys started giving me this totally-easy-to-comprehend error:

✘ unmarshal environment manifest for "prod": unmarshal environment manifest: yaml: unmarshal errors:
  line 21: cannot unmarshal !!str `sg-000{redacted}...` into manifest.securityGroupRule

I was able to figure out I should check my "prod" environment YAML file, which hasn't changed. The lines in question seem perfectly reasonable per the documentation at first glance, until you realize the docs flip back and forth between the expected parameters of:

network.vpc.security-group.xyz and network.vpc.security_group.xyz

I was using security_group with an underscore, but the correct answer was security-group with a dash.

bvtujo commented 1 year ago

Hi @Offlein, just to confirm, you had success using the key security-group? That's surprising to me; I worry that yaml is swallowing your SG config instead of parsing it.

Can you help me understand by showing me the manifest snippet which resulted in the error? That would help a lot with reproduction and troubleshooting.

bvtujo commented 1 year ago

Oh, I think I see what's happening here. It looks like you're trying to pass a security group ID to network.vpc.security_group in the environment manifest.

The environment manifest's security_group field is only specifiable as a map. It lets you customize the security group rules for the copilot-managed security group that we create with every environment.

type: Environment
name: prod
network:
  vpc:
    security_group:
      ingress:
        - ip_protocol: tcp
          ports: 80  
          cidr: 10.0.1.0/24
      egress:
        - ip_protocol: tcp
          ports: 80  
          cidr: 0.0.0.0/0

If you want to attach additional security groups to your services or jobs, you can specify those as SG IDs in the network.vpc.security_groups field, like so:

type: Backend Service
name: be

network:
  vpc:
    security_groups: [sg-000001]

This is actually pretty confusing, and I apologize. We should probably do a better job of explaining this in our environment docs.

Offlein commented 1 year ago

@bvtujo Thanks for the thoughtful reply. I believe your understanding is correct. But I'm not clear why it ever worked in that case. (Except that maybe I went in via the Web UI and somehow overrode this?)

To be clear, there was a typo that it looks like you've fixed (#5275) about security-group instead of security_group.

And I believe, looking back, I can see how my initial report could be super confusing as well.

...In my environment manifest.yml, I was setting the network.vpc.security_group.egress value to an array of security group IDs:

network:
  vpc:
    id: vpc-xxxxx
    subnets:
      public:
        - id: subnet-aaaa
        - id: subnet-bbbb
      private:
        - id: subnet-cccc
        - id: subnet-dddd
    security-group:
      egress: [ sg-abcdabcdabcd ]

It's because the documentation claims it is an Array of Security Group Rules and it was not clear to me what a "Security Group Rule" was.

(I can infer it is something in the structure:

- ip_protocol: tcp
  ports: 80   
  cidr: 0.0.0.0/0

although I somehow didn't infer this at the time.)

I will set the VPC security group IDs per your recommendation and see how that goes.

As an aside, I feel like I am experiencing lots of issues during this process, and it's exacerbated by (1) the fact that each deploy takes ~4-5 minutes if I'm very lucky, but usually closer to 8-12 minutes, and then once it's up I have no means of really inspecting what's running. I almost feel like maybe it'd be worthwhile to build an SSH server into my container and see if that helps, at least while testing?

bvtujo commented 1 year ago

I see, that makes sense. Currently Copilot doesn't let you specify security group IDs in the vpc security group egress rules, but you could do so by using copilot env override and yamlpatch.

For example, with the following manifest:

type: Environment
name: prod

You could run copilot env override, go through the prompts, then use the following patch to set things up: with the following ingress/egress rules:

- op: add 
  path: /Resources/EnvironmentSecurityGroup/Properties/SecurityGroupIngress
  value: 
    - SourceSecurityGroupId: sg-12345
      IpProtocol: tcp,
      FromPort: 80
      ToPort: 80
      CidrIp: 10.0.1.0/24

This would result in the right security group ID being applied to your rules, but has the downside of requiring you to write raw CFN.

You could instead do

type: Environment
name: prod
network:
  vpc:
    security_group:
      ingress:
        - ip_protocol: tcp
          ports: 80  
          cidr: 10.0.1.0/24

and the following, more specific patch:

- op: add 
  path: /Resources/EnvironmentSecurityGroup/Properties/SecurityGroupIngress/0/SourceSecurityGroupId
  value: sg-12345

To help ease your deployment woes, I have a couple of answers for you.

If your deployments are just spinning for a long time before failure, >v1.30.0 includes functionality to interrupt deployments and roll back with ctrl+C.
You can inspect the running code by running copilot svc exec, which uses the SSM agent to set up a secure shell session inside your running containers.

bvtujo commented 1 year ago

You can check out the raw CFN which will be deployed by running copilot env package --output-dir env-cfn, where you can see all these properties and resource names.

Offlein commented 1 year ago

Thank you so much Austin. I am very impressed by your responsiveness and receptivity to feedback, and it really means a lot when, eh, one might not have high expectations from a large organization like AWS.

So I was very excited about your copilot svc exec command. Is this new? In the past I had looked for this and found nothing except an AWS re:Post question stating there was no way to access a running App Runner instance. (And some Stack Overflow answers saying the same).

Anyway, I tried running it and got a notice saying it would install the Session Manager plugin if I wanted, and I said yes, but it failed because I don't use "yum". (I'm on EndeavourOS / Arch.) I grabbed it from the AUR though and that's all well and good. After running it, it asks which environment I want, then says:

✘ executing a command in a running container part of a service is not supported for services with type: 'Request-Driven Web Service'

So maybe the App Runner stuff above is not inaccurate? :)

Otherwise, there are some things based on what you said that are puzzling to me, but I want to be very cognizant in case this goes beyond the scope of this conversation, in which case please feel free to dismiss my questions. I do feel like they are likely very common points of confusion for other users, however, and as such it may be helpful to you if I voice them, so I will. :)

[1] We're running a [Laravel, PHP-based] backend application with this. It talks to an RDS database and [I'm currently adding] an Elasticache Redis instance. I initially was faffing around with the Security Groups so that I could ensure App Runner could communicate (outbound) to the RDS Database that is already opened to some EC2 instances.

-- I'll pause to say that I know my understanding of Security Groups is imperfect, but I also feel like I have at least a "working" mental metaphor for them. --

2 EC2 instances share the same Security Group -- let's call it "StageSG" -- and our Stage RDS instance has a Security group that allows access from StageSG on the DB port. It seems to work.

We have a different Production RDS instance and a different Production EC2 instance that has a different pair of connected security groups (say, "ProdSG").

I was trying to get our Stage AppRunner/copilot environment to run in the same EC2 VPC with that same "StageSG" Security Group so it automatically works without affecting the RDS Security Group. (And, of course, have the Prod AppRunner/copilot environment use the existing Production VPC and existing ProdSG Security Group.)

This feels like it would be a massively common use case, I assume? But maybe I'm wrong. This is why I erroneously believed I could specify the egress security groups per-environment manifest.yml.

I guess my confusion is how the expected use case could be that the VPC/Security Group is specified for the entire service. I would think that VPCs/Security Groups almost always are different per environment?

[2] I read through the "Backend Service" manifest docs per your earlier comment about the network.vpc.security_groups, and with attention to the environments key, it seemed like I COULD override that network.vpc.security_groups key per environment. I just tried this, and it did not seem to have any effect unfortunately. (I'm determining it didn't have an effect by viewing the "Networking" section of the App Runner UI. It has some security groups listed, but they aren't the ones I put into the Service's manifest.yml.)

This feels likely because I'm configured as a "Request-Driven Web Service"? The manifest documentation for that does not include the network.vpc.security_groups key at all. So of course it was even less likely to work.

[3] It's not entirely clear to me why we override things in the service manifest.yml's environment map versus, say, sticking them into environment files. (I'd previously been using that only doing it with variable overrides per environment.)

[4] I read through the differences between the different types of application service manifests when I first set this up a few months ago. I for some reason could not determine that one was obviously more-correct for us than another. Our EC2 instances are on a private network, accessible only to Internet Traffic through a Load Balancer or to developers through a bastion EC2 instance. So I might've thought I wanted a "backend service". But I definitely do want Internet-users somehow getting to it, so I thought maybe "Request-Driven Web Service" or "Load Balanced Web Service". Our app will primarily experience traffic during business hours in US timezones, so "Request-Driven" seemed more appropriately. But I'm not sure if that was a big mistaken and if I can even change it at this point.

Thanks for all you do.

bvtujo commented 1 year ago

@Offlein thanks so much for the kind words. Your use case makes a lot of sense and is quite common, we definitely want to make sure it's easy to connect your services to existing security groups. I agree that you might actually want a LBWS, but there are workarounds to make this easier on you so you don't have to migrate.

We actually support connecting App Runner to VPC resources. This feature shipped after app runner launched and involves a resource called an AWS::AppRunner::VpcConnector. This resource allows app runner to talk to services in a VPC, with or without specific security groups.

When you specify private placement for a RDWS:

network:
  vpc:
    placement: private

Copilot will create a VPC Connector for you, and allow it to talk to the EnvironmentSecurityGroup that Copilot creates and which all services in an env use to communicate.

To connect App Runner to another security group, after setting placement to private and deploying the service, you'll probably have to add some custom ingress and egress rules. You can model these in CFN with the AWS::EC2::SecurityGroupIngress and AWS::EC2::SecurityGroupEgress constructs. Copilot lets you deploy additional CFN resources via the addons functionality.

For example, to configure addons for your app runner service, you'd create the following files:

./copilot/yourservice/
└── addons/
    ├── template.yml
    └── addons.parameters.yml

# template.yml
Parameters:
  App:
    Type: String
  Env:
    Type: String
  Name:
    Type: String
  ServiceSecurityGroup:
    Type: String
  RDSSecurityGroup:
    Type: String
Resources:
  ServiceSecurityGroupEgressToRDSSecurityGroup:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      GroupId: !Ref ServiceSecurityGroup
      IpProtocol: -1
      DestinationSecurityGroupId: !Ref RDSSecurityGroup
  RDSSecurityGroupIngressFromServiceSecurityGroup:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref RDSSecurityGroup
      SourceSecurityGroupId: !Ref ServiceSecurityGroup
      IpProtocol: -1

# addons.parameters.yml
Parameters:
  RDSSecurityGroup: ${REPLACE_ME_SG_ID}
  ServiceSecurityGroup: !Ref ServiceSecurityGroup # This references the security group from the parent workload template.

I hope this helps for your use case.

bvtujo commented 1 year ago

For the different values per environment problem, you can use Mappings in your addons template, like so:

#template.yml
Transform: 'AWS::LanguageExtensions'
Parameters:
  App:
    Type: String
  Env:
    Type: String
  Name:
    Type: String
  ServiceSecurityGroup:
    Type: String
Mappings:
  RDSSecurityGroupIdMap:
    test:
      "Id": sg-1234
    prod:
      "Id": sg-5678
    DefaultValue: noEnvironment
Conditions:
  RecognizedEnvironment: !Not [ !Equals [ noEnvironment, !FindInMap [ RDSSecurityGroupIdMap, !Ref env, Id ] ] ]

Resources:
  NewSecurityGroup:
    Condition: !Not RecognizedEnvironment
    Type: AWS::EC2::SecurityGroup
#...
  NewSGIngressFromServiceSG:
    Condition: !Not RecognizedEnvironment
    Type: AWS::EC2::SecurityGroupIngress

  RDSSecurityGroupIngressFromServiceSG:
    Condition: RecognizedEnvironment
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !FindInMap
        - RDSSecurityGroupIdMap
        - !Ref Env
        - Id
      SourceSecurityGroupId: !Ref ServiceSecurityGroup
      IpProtocol: -1

# addons.parameters.yml
Parameters:
  ServiceSecurityGroup: !Ref ServiceSecurityGroup

edited to include the new default value feature for FindInMap and conditional logic to create a new SG and ingress if the env isn't recognized

Offlein commented 1 year ago

@bvtujo Just wanted to say thanks for all your help here. I had a bit of trouble understanding the whole interplay of Copilot, App Runner [sometimes] and ECS [sometimes], and CloudFormation, but after doing a lot of reading and fiddling, I think I'm in a better spot, thanks largely to your support.

This issue can be closed!

aws / copilot-cli

Incorrect environment manifest documentation leads to unhelpful YAML "unmarshal" error #5266