AndrewGuenther / fck-nat

Feasible cost konfigurable NAT: An AWS NAT Instance AMI
https://fck-nat.dev
MIT License
1.33k stars 53 forks source link

Problem starting fck-nat #84

Closed arthurl31 closed 5 months ago

arthurl31 commented 5 months ago

Troubleshooting Lambda Internet Access Using a NAT Instance

I'm having trouble getting my Lambda function to access the internet using a NAT Instance (fck-nat). To diagnose the issue, I decided to SSH into my instance.

NAT Instance Status

On my instance, the status of fck-nat is as follows:

[ec2-user@ip-10-0-1-9 ~]$ sudo systemctl status fck-nat.service
● fck-nat.service - Configure this machine to act as a NAT instance. fck-nat.
   Loaded: loaded (/etc/systemd/system/fck-nat.service; enabled; vendor preset: disabled)
   Active: activating (start) since Mon 2024-06-10 12:15:27 UTC; 6min ago
 Main PID: 902 (fck-nat.sh)
   CGroup: /system.slice/fck-nat.service
           ├─ 902 /bin/sh /opt/fck-nat/fck-nat.sh
           └─2520 sleep 1

Jun 10 12:21:30 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Device "eth1" does not exist.
Jun 10 12:21:30 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Waiting for ENI to come up...
Jun 10 12:21:31 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Device "eth1" does not exist.
Jun 10 12:21:31 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Waiting for ENI to come up...
Jun 10 12:21:32 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Device "eth1" does not exist.
Jun 10 12:21:32 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Waiting for ENI to come up...
Jun 10 12:21:33 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Device "eth1" does not exist.
Jun 10 12:21:33 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Waiting for ENI to come up...
Jun 10 12:21:34 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Device "eth1" does not exist.
Jun 10 12:21:34 ip-10-0-1-9.ec2.internal fck-nat.sh[902]: Waiting for ENI to come up...

My fck-nat.conf is:

[ec2-user@ip-10-0-1-9 ~]$ cat /etc/fck-nat.conf
eni_id=eni-044887af1d80c4ca1

My Network Interface

image

My Route Table

image

When using curl or ping within the instance (connected via SSH), it works, but when requests are made using my Lambda function, it fails (timeout).

Also, my Lambda is in a private subnet that uses the route table in the image above.

Config Used in Lambda Function


  PrivateSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref MyVPC
      CidrBlock: 10.0.4.0/24
      AvailabilityZone: 'us-east-1a'

  PrivateRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref MyVPC

  PrivateRoute:
    Type: AWS::EC2::Route
    DependsOn: PrivateSubnet
    Properties:
      RouteTableId: !Ref PrivateRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      InstanceId: !Ref FckNatInstance

  SubnetRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PrivateSubnet
      RouteTableId: !Ref PrivateRouteTable

My fck-nat instance configuration:


  NatSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security Group for NAT
      VpcId: !Ref MyVPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22  # SSH port
          ToPort: 22
          CidrIp: 0.0.0.0/0
        - IpProtocol: tcp
          FromPort: 0
          ToPort: 65535
          SourceSecurityGroupId: !Ref LambdaSecurityGroup # Allow traffic from Lambda function
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          IpProtocol: "-1"  # Allow all outbound traffic

  # Network Interface
  FckNatInterface:
    Type: AWS::EC2::NetworkInterface
    Properties:
      SubnetId: !Ref PublicSubnet
      GroupSet:
        - !Ref NatSecurityGroup
      SourceDestCheck: false

  NatRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: attachNatEniPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ec2:AttachNetworkInterface
                  - ec2:ModifyNetworkInterfaceAttribute
                Resource: "*"
        - PolicyName: associateNatAddressPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ec2:AssociateAddress
                  - ec2:DisassociateAddress
                Resource: "*"
        - PolicyName: SSMandEC2MessagesPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ssm:DescribeAssociation
                  - ssm:GetDeployablePatchSnapshotForInstance
                  - ssm:GetDocument
                  - ssm:DescribeDocument
                  - ssm:GetManifest
                  - ssm:ListAssociations
                  - ssm:ListInstanceAssociations
                  - ssm:PutInventory
                  - ssm:PutComplianceItems
                  - ssm:PutConfigurePackageResult
                  - ssm:UpdateAssociationStatus
                  - ssm:UpdateInstanceAssociationStatus
                  - ssm:UpdateInstanceInformation
                Resource: "*"
              - Effect: Allow
                Action:
                  - ssmmessages:CreateControlChannel
                  - ssmmessages:CreateDataChannel
                  - ssmmessages:OpenControlChannel
                  - ssmmessages:OpenDataChannel
                Resource: "*"
              - Effect: Allow
                Action:
                  - ec2messages:AcknowledgeMessage
                  - ec2messages:DeleteMessage
                  - ec2messages:FailMessage
                  - ec2messages:GetEndpoint
                  - ec2messages:GetMessages
                  - ec2messages:SendReply
                Resource: "*"

  FckNatAsgInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Roles:
        - Ref: NatRole

  # Launch Template for NAT Instance
  FckNatInstance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t4g.nano
      ImageId: ami-05b6d5a2e26f13c93
      SubnetId: !Ref PublicSubnet
      SecurityGroupIds:
        - !Ref NatSecurityGroup
      IamInstanceProfile: !Ref FckNatAsgInstanceProfile
      KeyName: key-dev
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          echo "eni_id=${FckNatInterface}"
          echo "eni_id=${FckNatInterface}" >> /etc/fck-nat.conf
          sudo systemctl restart fck-nat.service

Am I missing or misconfiguring something?

AndrewGuenther commented 5 months ago
        - IpProtocol: tcp
          FromPort: 0
          ToPort: 65535
          SourceSecurityGroupId: !Ref LambdaSecurityGroup # Allow traffic from Lambda function

Could you open up the security groups and see if that's the issue? Just allow ingress from all IPs in your VPC at the very least and see if that works?

Scanning your config, this should all be working and I'm not seeing any issues in the included logs. (Thanks for the detailed report btw!)

arthurl31 commented 5 months ago

I've updated my CloudFormation template for the NAT security group, but I'm still facing issues:

 NatSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security Group for NAT
      VpcId: !Ref MammycardVPC
      SecurityGroupIngress:
        - IpProtocol: "-1"
          CidrIp: 0.0.0.0/0  # Allow all ingress traffic from all IP addresses
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          IpProtocol: "-1"  # Allow all outbound traffic

When I connect to the NAT instance using SSH, the nat-fck.service fails to start. The configuration file /etc/fck-nat.conf contains a valid eni_id, as shown in previous screenshots.

However, attempting to start the service results in what seems to be an infinite loop. Below is a screenshot showing the never-ending loop from the last executed command (sudo systemctl start fck-nat.service):

image

Aditional Information

PS: since the Security group is now public, I could send you the ssh key if you need to, no problem at all.

arthurl31 commented 5 months ago

Update

I'm not entirely sure what caused the issue, but it seems there was a bug when the instance initially started.

I've updated the NAT Instance template, rerun it (recreating the instance), and now it's functioning properly.

My Lambda function can now connect to the internet without any issues.

I've reverted my security group settings to only allow access for my Lambda function and my IP via SSH, and it's still working fine.

This is my EC2::Instance for my NAT Instance in case anyone needs:

  FckNatInstance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t4g.nano
      ImageId: ami-05b6d5a2e26f13c93
      SubnetId: !Ref PublicSubnet
      SecurityGroupIds:
        - !Ref NatSecurityGroup
      IamInstanceProfile: !Ref FckNatAsgInstanceProfile
      KeyName: "key-dev"
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          echo "eni_id=${FckNatInterface}" >> /etc/fck-nat.conf
          exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
          echo "=== Instance Initialization Started ==="
          sudo systemctl enable fck-nat.service
          sudo systemctl restart fck-nat.service
          sudo systemctl status fck-nat.service > /var/log/fck-nat-service-status.log

Thank you very much for this fantastic project and for your support.

AndrewGuenther commented 5 months ago

Glad you got it working!

Without any specific error info I'm gonna close this one out, but I'll keep this on mind in case we get reports of similar behavior.