AWS Controllers for Kubernetes (ACK)

mhausenblas commented 5 years ago

Many of you are aware of the AWS Service Operator (ASO) project we launched a year ago. We reviewed the setup with existing stakeholders and decided to turn it into a first-tier OSS project with concrete commitments from the service team side, based on the following tenets:

ASO is a community-driven project, based on a governance model defining roles and responsibilities.
ASO is optimized for production usage with a full test coverage including performance and scalability test suites.
ASO strives to be the only codebase exposing AWS services via a Kubernetes operator.

Going forward, we will archive the existing awslabs/aws-service-operator repo and create a new one under the aws organization, with the goal to re-use as much of the existing code base as possible. In this context, we will introduce the high-level changes, including re-platforming on Kubebuilder, which should help lower the barrier to contribute, and clarifying the access control setup (see also #23).

At the current point in time, we'd like to gather feedback from you concerning:

What are the most important AWS services for you (S3, DynamoDB, etc.) to be prioritised?
What are your use cases for ASO?
Would you be interested in contributing, and if so, in which capacity (anything from reporting on usage to docs to code contributions)?

I'm super excited about what has been achieved so far thanks to @christopherhein's excellent bootstrapping work and looking forward to take the next steps, together with you.

===

UPDATE@2020-07-17: we renamed the project to AWS Controllers for Kubernetes (ACK).

UPDATE@2020-08-19: ACK is now available as a developer preview

christopherhein commented 5 years ago

Excited to see this coming out!

@mhausenblas is the idea to hand build each resource going forward using kubebuilder or is there plans make it more maintainable long term with code generation? This was one of the pitfalls that I found with the current version it didn't scale by building more and more by hand.

For use cases make sure to look through the existing issues, you'll see a lot of folks have written out what they see as important services.

Do you have any concerns with it being the only code base and ending up in a monorepo style like Kubernetes is where we're starting to break out resources into discrete components to help maintain them better given ownership?

What does the commitment look like from the service team side? Will they be developing controllers? Will other teams be contributing?

👍

mhausenblas commented 5 years ago

Thanks @christopherhein and happy to see your continued interest and support for ASO!

is the idea to hand build each resource going forward using kubebuilder or is there plans make it more maintainable long term with code generation?

The goal is to automate as much as possible, enabling individual service teams to own their parts.

For use cases make sure to look through the existing issues, you'll see a lot of folks have written out what they see as important services.

Yup, we did, now asking for more/new insights based on experiences so far ;)

Do you have any concerns with it being the only code base and ending up in a monorepo style like Kubernetes is where we're starting to break out resources into discrete components to help maintain them better given ownership?

I don't have a strong opinion on this at the current point in time.

What does the commitment look like from the service team side? Will they be developing controllers? Will other teams be contributing?

Details to follow, but yes, see above, the goal is to enable other service teams.

Thanks, again, and hope we can benefit from your expertise, going forward.

colinhoglund commented 5 years ago

ASO strives to be the only codebase exposing AWS services via a Kubernetes operator.

@mhausenblas Awesome news! Do you happen to know what impact, if any, this will have on the https://github.com/awslabs/aws-servicebroker project?

dcherman commented 5 years ago

As a general comment on all resources, it would be great if we had some way to define IAM role(s) with differing permissions, or maybe some kind of iamRoleRef that would allow you to add policies/permissions to an existing role that was created by an IAMRole CRD. The latter is likely much more flexible.

The use case for that would be that as it stands with the operator now, there is no way to create IAM roles or add policies to existing ones, so you would either need to modify a role manually in order to use kube2iam/kiam, or you would end up needing to give the node where the containers are running way too many privs.

mhausenblas commented 5 years ago

@colinhoglund I have unfortunately no information available on the Service Broker project.

mhausenblas commented 5 years ago

@dcherman agreed and it was no coincidence I referenced #23 in this context ;)

Once we have the new repo, keep an eye out for the respective issue, but in a nutshell, yes that's the idea.

christopherhein commented 5 years ago

@jaymccon

Spareo commented 5 years ago

I think IAM roles would be one pf the most useful features. However, I feel like an operator that makes IAM. roles can be as dangerous as it is convenient so a lot of thought would have to go into how to lock that operator down in case an attacker was able to get onto that pod.

ECR Repos with the ability to create and attach policies to them would be very useful as well.

Private link services and endpoints for exposing services from inside our EKS clusters to on premises would be very useful (I'm building one right now for my company).

One challenge that we have run into with AWS and creating resources with operators is how to expose the ARNs or names of the resources we create back to other devs who have access to specific namespaces in the cluster but no access to AWS console.

ahilsend commented 5 years ago

Being able to manage S3 buckets, and RDS including IAM roles for accessing those would be great.

I imagine dealing with all IAM could be difficult, though exposing only the ones needed for the specific resources should be more feasible.

gjmveloso commented 5 years ago

Excited seeing this becoming public and would love to contribute with docs, code, especially test coverage/automation, and use cases.

Services to be prioritized from my perspective/experience:

EKS/ECS/Fargate themselves
AWS Batch
EC2 Fleet (including Spot)
S3
DynamoDB
RDS, including Aurora
ElastiCache
Step Functions
Lambda
Everything else 😇

LeoAdamek commented 5 years ago

I agree with others that IAM roles/policies is probably the most important thing to get right, though perhaps this needs to be coordinated with kube2iam.

Ideally a CRD would exist to create IAM roles with all policies with a mechanism to reference other AWS resources created by ASO CRDs, and kube2iam could then introduce a new annotation or format to reference an IAM role by its CRD name rather than the ARN of the created role.

e.g.

apiVersion: service-operator.aws/v1alpha1
kind: IAMRole
metadata:
  name: ServiceRole
spec:
  # AssumeRolePolicyDocument would be auto-generated along with role name and path.
  policies:
    - name: Access CRDS3Bucker
      effect: Allow
      action: ['s3:PutObject','s3:GetObject','s3:DeleteObject']
      resource:
        - crd: {s3: NameOfAnS3BucketCRD }
---
apiVersion: v1
kind: Pod
metadata:
  name: Some Pod
  annotations:
     kube2iam/role-crd: ServiceRole
spec: # Pod Spec

As time permits would be interested to contribute code to make this happen.

webframp commented 5 years ago

We have been using the existing ASO for a few months now. After testing out the service broker approach more than a year ago.

The addition of kube2iam in our cluster has not been a problem for us, though tighter integration with IAM via something like #23 would be great if it removes the need for kube2iam entirely.

We run a custom built ASO image in our clusters that extends the available services and we appreciated how relatively easy it was to do this once we understood how to extend the existing models. The model <> cfn template mapping currently used makes it nice took a little work to learn how to extend. Ease of extensibility is important to us for whatever future design is used.

We have added our own models to support the following:

RDS (MySQL, Postgres, Aurora)
Cloudfront
DocumentDB
Elasticache
Route53 healthchecks
SNS Subscriptions (overrides the broken built in model)

In terms of ordering for officially supported resources, RDS, DocumentDB and Lambda would be high on the list I'd think.

One challenge we found is secret handling. We would very much like to see some integration with EKS such that secrets could easily be ref'd somehow, even possibly backed by SSM parameters.

jwenz723 commented 5 years ago

Would love to see the following services supported:

DynamoDB
Kinesis
SecretsManager
SNS
SQS

mtparet commented 5 years ago

We are deploying at least one new application in our kubernetes production by week, if we could directly create the AWS resources needed within our helm chart it would be a real game changer.

The main resources needed are:

SNS
SQS
S3
IAM
Cloufront
Route53

sftim commented 5 years ago

@Spareo wrote:

an operator that makes IAM. roles can be as dangerous as it is convenient so a lot of thought would have to go into how to lock that operator down in case an attacker was able to get onto that pod.

If the AWS service operator had a built-in understanding of IAM permissions boundaries, it could create and manage roles with a suitable boundary applied, whilst running as an IAM role that only allows creating roles that have a whitelisted boundary in place.

That takes some work to get right. It's a nice approach that mitigates some of the risks from the operator's Pod(s) or Secret(s) getting compromised.

gflarity commented 5 years ago

RDS, ElasticSearch, DynamoDB

trajakovic commented 5 years ago

We'd like to see support for:

IAM, Route 53, Cognito, S3, EFS, RDS, Neptune

flexera-cnixon commented 5 years ago

DynamoDB, Kinesis, IAM, Elasticcache (Redis specifically), s3 and slightly less useful RDS for us

moustafab commented 5 years ago

RDS, S3, Elasticache, and MSK

zihaoyu commented 5 years ago

API Gateway

rahuldivgan commented 5 years ago

RDS, Elasticache, S3, Lambda, Step Functions

daviddyball commented 5 years ago

@mhausenblas

My colleagues and I spent the last year writing/running our own aws-controller project built around Kubebuilder v1. We have support for:

IAM Roles (IAMRole)
S3 Buckets (S3Bucket)
DynamoDB Tables (DynamoDbTable)
DynamoDB Global Tables (DynamoDbGlobalTable)
SNS Topics (SNSTopic)
SNS Topic Subscriptions (SNSTopicSubscription)
SQS Queues (SQSQueue)
ElastiCache Instances (ElastiCacheInstance)
RDS Instances (RDSInstance)
KMS Keys ('KMSKey`)

All of these have Create support and some, to a degree, have Update support. We deliberately held off on adding Delete logic because, at the time of writing, finalizers weren't supported by Kubebuilder. It's on our list of TODO items :stuck_out_tongue:

Our AWS Technical Account Manager had advised us that ASO was a work-in-progress at the time, but our timeline meant that we couldn't wait for ASO to come out, so we forged ahead with our own controller. We eventually reviewed ASO and found that we preferred the Kubebuilder based aws-controller project though. I had eventually wanted to open-source it, but it would require a hefty re-write and sign-off from the higher-ups to release into the wild.

My observations with our Kubebuilder-based aws-controller so far are:

Come up with a concrete contract for transforming from CRDs to AWS API calls early on. We wrote our S3Bucket controller first and DynamoDbTable controller last. The DynamoDbTable controller has a much better way of handling CRUD operations as a result of the knowledge gained.

We ended up with helper functions on our CRDs like S3BucketSpec.getCreateBucketInput() and S3BucketSpec.getDeleteBucketInput() that returned &s3.CreateBucketInput{} or &s3.DeleteBucketInput{} objects that can be passed directly to the API from Reconcile().

We had limited success with Update logic. That usually becomes very CRD-specific functionality in the Reconcile() function. Where possible we would try to make helper functions on the CRD structs again (e.g. S3BucketSpec.getPutBucketPolicyInput() or S3BucketSpec.getPutCorsInput())
Use Open Policy Agent and write policies to block people creating IAMRole with god-mode access and make sure it's in your CI/CD pipeline to validate this stuff.

I argue strongly that it's not the controllers responsibility to manage RBAC or who can do what with CRDs, that should be OPA and/or cluster RBAC. Definitely don't try to tackle this in the controller, it's not worth the hassle.
Definitely implement finalizers (we haven't yet and as such we can't Delete anything)
Use Kubebuilder v2 over v1. There is absolutely no need to use v1 now that v2 is released. It greatly simplifies the codebase.
The biggest headache for writing a project to wrap so many APIs is tiny differences in each APIs implementation. We can't really ask AWS to re-write all their APIs to follow standards, so we just have to live with it :man_shrugging.

One of my projects over the next few months is to update our code to Kubebuilder v2. If there's enough traction on this rewrite/restructure, perhaps I can devote my time to this project rather than maintaining our own internal controller that benefits nobody else?

mhausenblas commented 5 years ago

Thanks a lot for sharing your feedback, advices, and experiences @daviddyball, very much appreciated!

If there's enough traction on this rewrite/restructure, perhaps I can devote my time to this project rather than maintaining our own internal controller that benefits nobody else?

That would be fantastic.

sebgoa commented 5 years ago

that's great news, this cannot come soon enough.

My main requirement is that this controller be able to run outside AWS infrastructure. Meaning not tied to EKS or EC2 instances. I had to do a super minor patch to ASO to get it to run in GKE:

https://github.com/awslabs/aws-service-operator/commit/9e775d1c767192d37e81a5f53ef9485769a13f43

Then IAM, S3, SQS, Kinesis and Lambda.

ankon commented 5 years ago

We have been using our own implementation of something similar: https://github.com/Collaborne/kubernetes-aws-resource-service

Based on our usage:

IAM is absolutely critical (we're using kube2iam and are considering to move to the AWS-provided IAM integration)
S3 buckets including the ability to configure policies, versioning, etc.

whereisaaron commented 5 years ago

Rather than writing operators for each AWS service API would it be better to transform k8s CRD to CloudFormation micro-stacks and leverage existing CloudFormation capability to handle updates intelligently?

AWS would need to be willing to lift the lid on CF stacks, to allow 1000’s of micro stacks, rather than the current AWS assumption that people will use only a few monolithic stacks.

mhausenblas commented 5 years ago

@whereisaaron thanks for your feedback! CF should be treated as an implementation detail and we want to abstract it. Once the new repo is available under aws GitHub org, which should be very soon, we can continue the discussion there. We, that is @jaypipes and myself, will create dedicated issues for Kubebuilder, IAM handling, etc. in the new repo and that's the best place to have this convo.

daviddyball commented 5 years ago

I'd personally vote against using CF to implement any of the controller logic. To me using the APIs directly for controller implementation is a much cleaner approach. Like you say @mhausenblas, I'm sure this can be discussed in the new repo once it's available.

Any way to get a notification when the repo becomes available?

mhausenblas commented 5 years ago

@daviddyball thanks for your feedback as well and indeed we'll make sure to announce it here on this issue when the repo is in place, so if you're subscribed here that should be sufficient.

whereisaaron commented 5 years ago

Thanks @mhausenblas! My CF comment was in relation to reviewing the implementations @ankon and @seboga offered. They look useful and real examples of an AWS Operator like that we are discussing here. But looking at them it occurred to me that handling creates, updates, rollbacks, and clean deletions with dependencies, across dozens or hundreds of APIs, is non-trivial.

CF and Terraform achieve this, tracking API changes for the APIs discussed here, with varying amount of lag. If AWS wants to repeat this effort for the cleaner approach - as @daviddyball suggested - then great! But already AWS CF can't keep up with AWS API changes, so it worries me that a separate AWS Operator implementation might suffer similar delays. Is AWS willing to commit those resources to maintain the proposed AWS Operator, when already CF appears very under-resourced for tracking those same API's?

When you do get to the implementation decisions stage, the ability to keep that implementation up to date with the AWS APIs should also be considered as part of those implementation details. No point deciding that X would be the perfect implementation choice, if that choice can't be maintained.

daviddyball commented 5 years ago

@whereisaaron my understanding of the operator pattern is that it's level-based, so there's no state tracking. Every time Reconcile() is called it's the controllers job to go and query the current "state of the land" and then decide whether it needs to make any further changes to match the incoming K8s object. Relying on something state-based like CF or Terraform goes against this as it becomes mandatory to track the state.

With regard to the feasibility of maintaining API compatibility, it'd be open-source, so anyone can commit changes and fixes if API's change over time. Other projects like Boto seem to manage, so I can't see it being an issue here (unless there is no community uptake)

whereisaaron commented 5 years ago

Good points @daviddyball 👍

cpoole commented 5 years ago

How do people feel about a native integration with terraform for handling the CRUD operations? I agree with @whereisaaron that maintaining independent lifecycle code is a massive undertaking.

@daviddyball Two huge reasons terraform keeps track of state are so that runs dont blow through rate limits and so that runs can make attributes about an object exportable to other runs. If I create an s3 bucket I need that ARN to be exported to my IAM role or to my config service. If this project becomes exclusively a stateless operator this would require every reference to refresh the state... Very quickly this would run into api limits.

Since terraform 0.12 and above are fully json compatible it would be rather simple to put the objects into the CRD and have the operator hook into the already maintained terraform provider (eg. the 2500 lines of code it takes to manage an s3 bucket https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_s3_bucket.go)

The rancher team has done something quite similar with their experimental controller https://github.com/rancher/terraform-controller

once https://github.com/hashicorp/terraform/pull/19525 is merged state can be fully managed inside of kubernetes as well.

iAnomaly commented 5 years ago

@cpoole I dislike it greatly.

@whereisaaron and yourself make great points about the cost-benefit trade-offs of leveraging state:

To optimize for rate limiting (assume you mean at the AWS API layer?)
Resource discovery via that state

...but using Terraform as the implementation feels misappropriated for use by project in my opinion. Rancher's Terraform Controller already looks like the solution for that specific implementation approach and I would hate to see work repetition and overlap here on this project.

Generally as a user of CloudFormation (CFn) for years with more recently familiarity with the Kubernetes operator and controller patterns I do mostly agree with @daviddyball's points.

I think (or at least optimistically believe) that the potential contributor scale benefits of open-sourcing a project like this provide the key differentiator to the closed-source CFn team with respect to closing coverage gaps seen today between CFn and underlying AWS API updates.

I also think you could solve some of the rate limiting and resource discovery concerns with a lighter quorum or runtime cache pattern that you see prevalent in many other parts of the Kubernetes architecture and operators/controllers that is not as heavy or critical as Terraform's durable state. The excellent AWS ALB Ingress Controller project for example builds a cache state model at startup and maintains that for the purposes of minimizing API operations on each reconciliation loop when unnecessary. It rebuilds/recovers this cache if it dies/restarts which greatly reduces the operational overhead of having to store and protect state somewhere else be it StatefulSets or an external object store/database. This is one of the greatest advantages CFn provides over Terraform today in my opinion.

I'm no expert, but maybe there are already some excellent RESTful API CRUD controller Go libraries that could be core to this project and DRY out the repetition that would otherwise be required for every AWS service API that is implemented against?

Awesome discussion so far overall!

sftim commented 5 years ago

The open source code in, eg, https://github.com/terraform-providers/terraform-provider-aws is definitely available for study or reuse. I think that's the kind of sharing I'd like to see.

If anyone wants to use https://github.com/rancher/terraform-controller then I am happy for them. That's different from how I'd imagine an AWS-specific service Operator. Similarly if there's an Azure or whatever Operator.

daviddyball commented 5 years ago

@iAnomaly good points regarding caching at the controller level to reduce API calls. I've seen it done with the kiam controller for IAM-specific resources. Maybe that's something we can look into to reduce API cost.

We're already running our controller in production across 6 clusters in 4 regions and we're not hitting API rate limits (or if we are they are easily absorbed by the exponential back-off of the Reconcile() function.

I do agree that API limits are worth taking into account when designing this, I just don't agree that using CloudFormation or Terraform is the right way to fix it. To each their own though.

mhausenblas commented 5 years ago

FYI: the new repo is now available via aws/aws-service-operator-k8s and I'd like to invite everyone to have look at the design issues and contribute there, going forward.

mtparet commented 4 years ago

mhausenblas on Aug 30, 2019

10 months AWS said they were starting work on the next AWS Operator but there are still no versions available.

Is there someone still working on it ? Even, did someone already work on it and began to code it ?

ddseapy commented 4 years ago

see: https://github.com/jaypipes/aws-service-operator-k8s/commits/scaffolding

mtparet commented 4 years ago

thank you @ddseapy for pointing this, so there is hope!

jaypipes commented 4 years ago

There is hope indeed! Please see here: https://github.com/aws/aws-service-operator-k8s/tree/mvp

We're working on it, targeting some initial services in an MVP release at end of this quarter.

tabern commented 4 years ago

Happy to announce that AWS Controllers for Kubernetes is in Developer Preview with support for S3, Amazon SNS, Amazon SQS, DynamoDB, Amazon ECR, and AWS API Gateway.

You can learn more about the project on our project site.

Links

Next steps

We are closing this issue. Please comment or contribute directly on the [ACK GitHub project](). You can also see our service controller support roadmap on the project.

christopherhein commented 4 years ago

Congrats folks! 🎉 Awesome achievement!

/cc @tabern @jaypipes @mhausenblas

aws / containers-roadmap

AWS Controllers for Kubernetes (ACK) #456

Links

Next steps