aws-samples / aws-kube-code-service

The Code Services Continuous Deployment reference architecture demonstrates how to achieve continuous deployment of an application to a Kubernetes cluster using AWS CodePipeline, AWS CodeCommit, AWS CodeBuild and AWS Lambda.
Apache License 2.0
190 stars 160 forks source link

Name or service not known #4

Open rubensdevito opened 6 years ago

rubensdevito commented 6 years ago

When Lambda tries to deploy the changes it fails. Here's the CloudWatch Logs dump:

START RequestId: f5ff58dd-fc68-11e7-8aaf-910e87942b5f Version: $LATEST

XXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/k8s-c-repos-1bdxoih448581 d8d49eb0 codesuite-demo

2018-01-18 16:02:22,662 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7567c3a7f0>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /apis/extensions/v1beta1/namespaces/default/deployments/codesuite-demo

[WARNING] 2018-01-18T16:02:22.662Z f5ff58dd-fc68-11e7-8aaf-910e87942b5f Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7567c3a7f0>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /apis/extensions/v1beta1/namespaces/default/deployments/codesuite-demo

2018-01-18 16:02:22,663 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7567c3afd0>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /apis/extensions/v1beta1/namespaces/default/deployments/codesuite-demo

[WARNING] 2018-01-18T16:02:22.663Z f5ff58dd-fc68-11e7-8aaf-910e87942b5f Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7567c3afd0>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /apis/extensions/v1beta1/namespaces/default/deployments/codesuite-demo

2018-01-18 16:02:22,665 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7567c3a7b8>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /apis/extensions/v1beta1/namespaces/default/deployments/codesuite-demo

[WARNING] 2018-01-18T16:02:22.665Z f5ff58dd-fc68-11e7-8aaf-910e87942b5f Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7567c3a7b8>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /apis/extensions/v1beta1/namespaces/default/deployments/codesuite-demo

HTTPSConnectionPool(host='XXXXXXXXXXXXXXXXXX.us-west-2.elb.amazonaws.com', port=443): Max retries exceeded with url: /apis/extensions/v1beta1/namespaces/default/deployments/codesuite-demo (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7567c3a518>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Here's some information about my k8s cluster:

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T11:52:23Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"9befc2b8928a9426501d3bf62f72849d5cbcd5a3", GitTreeState:"clean", BuildDate:"2017-11-20T05:17:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

kubeProxyVersion: v1.8.4
kubeletVersion: v1.8.4
KOPS version: Version 1.8.0
ghost commented 6 years ago

@rubensdevito

I guess your K8S cluster running inside VPC and K8S api server gets resolved with an external IP address ( see Jeff Barr article here ). To check this, go to Route 53, check if your cluster looks like this:

api.k8s.example.com    A    52.xxx.xxx.xxx
api.internal.k8s.example.com    A    172.20.xxx.xxxx
etcd-a.internal.k8s.example.com    A    172.20.xxx.xxxx
etcd-events-a.internal.k8s.example.com    A    172.20.xxx.xxxx

To workaround this issue, you need to fix it on kubeconfig template (kube-manifests/config) with:

apiVersion: v1
clusters: 
- cluster:
    certificate-authority-data: $CA
    server: https://api.internal.$ENDPOINT
  name: $ENDPOINT
contexts:
- context:
    cluster: $ENDPOINT
    user: $ENDPOINT
  name: $ENDPOINT
current-context: $ENDPOINT
kind: Config
preferences: {}
users:
- name: $ENDPOINT
  user:
    client-certificate-data: $CLIENT_CERT
    client-key-data: $CLIENT_KEY

If you work with original K8S generated kubeconfig with RBAC (just like me), "password" and "username" is needed after "client-key-data" field. Of course, lambda and K8S need to be on same subnet as well as lambda and master(s) need to on the same security group.

BTW, although I got it deployed successfully (the image id really changed in K8S side), it still timeout (not reach posting codepipeline succeeded).

ghost commented 6 years ago

I finally got it working. This thread helps (look at Posted by BEm on Jun 29, 2017 3:09 AM):

https://forums.aws.amazon.com/thread.jspa?threadID=231990

dustyketchum commented 6 years ago

Can someone post complete instructions that work to fix this? I've added the lambda function to the VPC, assigned subnets to the lambda function, assigned security groups to the lambda function, and added the AWSLambdaVPCAccessExecutionRole policy to the roles created for the lambda function. Nothing helps, error message doesn't change, deployments fail.

omarlari commented 6 years ago

Hi - the problem is that the way this lambda and cfn is structured is that you need to have a public dns record. It assumes that the cluster is accessible via a route 53 record. Here are some of my thoughts on updates/changes/options:

  1. Change from Python client to Go client (just need to do this, Go is preferred language in the k8s community)

  2. Make config more adaptable to accomodate various auth methods

    • Have user upload their config file to encrypted/iam controlled s3 bucket (not a fan of this option, but it is super easy to implement)
    • Have user create RBAC credentials in k8s prior to deploying, then have cfn parameters to ingest those secrets into SSM/Parameter store
  3. Solve gossip/private endpoint problem

    • For gossip this is pretty simple, just need to add a few parameters to properly build the kubernetes config file
    • For private, will need to have the Lambda function inside the VPC with either IGW/NATGW access to retrieve assets from s3. We can implement conditions in the cfn template to deploy Lambda with vpc endpoint or not, depending on the type of cluster (gossip/public vs private).

Thoughts? Also, would love some help from anyone!

ghost commented 6 years ago

@dustyketchum

I saw you missing:

  1. the subnet assigned must be on PRIVATE subnet within the same VPC as k8s, even though your k8s is located in the public subnet.
  2. the private subnet must have NAT (NAT gateway or NAT instance) and proper routing.
  3. Lambda function must be assigned to in private subnet.

Below is the actual architecture diagram, although we use Github not CodeCommit.

workflowdetail

@omarlari

Actually I really don't know if EKS would change everything, and consequently CodeDeploy would have options to deploy to EKS. In that case, contributors might think about "why I need to work on something which will be soon updated?"

dustyketchum commented 6 years ago

@minghsieh-prenetics thanks, our subnets already had internet access, this wasn't our issue. My earlier message assumed that AWS networking was set up 'properly' w/ NATs, internet access available in private VPCs, etc. though I didn't explicitly state all that.

I believe the first problem is the instructions assume you have created a publicly available kubernetes cluster or you're using ec2 classic without a vpc (or perhaps both) - in either case, that assumption should be explicitly documented. This cloudformation template won't work as is for anyone with a cluster in a private network in a vpc.

This was my first exposure to lambda which made troubleshooting more challenging. I believe the changes I needed to make were, in order:

  1. I had the wrong api endpoint, I had failed to remove the leading 'api' from the fqdn when I passed it to cloudformation. The readme says "enter only the subdomain and omit 'api'" but the parameter description in the cloudformation template is missing this information. Cloudwatch logs helped here, I was able to see the wrong endpoint in the logs.
  2. Assign the AWSLambdaVPCAccessExecutionRole IAM policy to the new '...codepipelinelambdarole...' role created by the cloudformation template.
  3. Update the new '...Pipeline-xxxx-LambdaKubernetesDeployme..' lambda function created by the cloudformation template to add it to the correct vpc, add it to the correct subnets (the same subnets as your kube api endpoint), and add it to the correct security group(s) - SGs capable of using SSL to communication with the kube api endpoint). If you try to assign the lambda function to the VPC (this step) before adding the AWSLambdaVPCAccessExecutionRole policy to your IAM role (prior step above), you get a nice helpful error message telling you exactly what you need to do, but what isn't necessary obvious unless you've worked with lambda before is that the lambda function does NOT get added to the vpc when you see that error...
  4. Changing the DeploymentName parameter in the cloudformation template does not seem to work, leave the default. Cloudwatch logs again helped here.

The cloudformation template could be updated to handle items 2 and 3 without too much trouble (ask for the vpc, subnet(s), and security group as cloudformation parameters).

Thanks, Dusty

StevenACoffman commented 6 years ago

@minghsieh-prenetics Do you have a reference implementation for your lambda deploy into kubernetes?

ghost commented 6 years ago

@StevenACoffman Yes I do. But it's not much difference between this:

https://github.com/aws-samples/aws-kube-codesuite/blob/master/src/kube-lambda.py

Actually the essential part of this repo is just this kube-lambda.py file. Don't let other dependent files confuse you.

StevenACoffman commented 6 years ago

Ah thanks @minghsieh-prenetics ! I see that this lambda is also necessary: https://github.com/aws-samples/aws-kube-codesuite/blob/master/templates/ssm-inject.yaml

However, without the other cloudformation machinery, I'm not clear on how to get the eks client cert and client cert key. I can get the other two bits to set up parameter store trivially:

  ENDPOINT=$(aws eks describe-cluster --region us-east-1 --name $CLUSTERNAME --query cluster.endpoint)
  CA=$(aws eks describe-cluster --region us-east-1 --name $CLUSTERNAME --query cluster.certificateAuthority.data)

I launched my eks via the web console, so I am not sure how to get these other two pieces. Any help would be greatly appreciated!

ghost commented 6 years ago

@StevenACoffman

  1. My stack was built for KOPS. At that time, EKS is not available.
  2. How about just create a service account in EKS / K8S and grant what it suppose to do? In fact this is the better practice since we don't mess up with the root credentials - in case somebody steal credential from CodePipeline and you have way to counter it.
StevenACoffman commented 6 years ago

Ah, manually use eks authenticated kubectl to create a kubernetes service account and retrieve that cert and key, save those to parameter store like this example. Not as one-click, fully automated, but since it's only done once, then it could be ok. Thanks!

BranLiang commented 5 years ago

@StevenACoffman You saved my day. Thanks

StevenACoffman commented 5 years ago

Great! Check out these for more details on the two viable approaches: