argoflow / argoflow-aws

Argoflow-AWS has been superseded by deployKF
GNU Affero General Public License v3.0
44 stars 29 forks source link

Argoflow-AWS


⚠️ Argoflow-AWS has been superseded by deployKF ⚠️

deployKF makes it easy to build reliable ML Platforms on Kubernetes and supports more than just Kubeflow!

deployKF supports all Kubernetes distributions and has native integrations with AWS.



Original README

This project offers a Kubeflow distribution that has the following characteristics:

AWS Integrations

This distribution assumes that you will be making use of the following AWS services:

In the future we may develop overlays that would make some of these services optional, but for the current release if you wish to take them out this needs to be done after forking the repo.

AWS IAM Roles for Service Acocunts

Below you will find all of the IAM Policies that need to be attached to the IRSA roles. Before looking at the policies though, please take note of the fact that IRSA works via setting up a Trust relationship to a specific ServiceAccount in a specific Namespace. If you find that an IAM role is not being correctly assumed, it probably means that you are attaching it to a ServiceAccount that hasn't explicitly been authorized to do so.

Trust Relationships

Let's take the external-dns service as an example. The ServiceAccount for this application is defined here, is named external-dns and is rolled out in the kube-system Namespace. To allow this ServiceAccount to assume an IAM Role, we have to set a trust relationship that looks as follows:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-central-1.amazonaws.com/id/SOMEUNIQUEID1234567890"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.eu-central-1.amazonaws.com/id/SOMEUNIQUEID1234567890:sub": "system:serviceaccount:kube-system:external-dns"
        }
      }
    }
  ]
}

For every IRSA Role you set up, you will need a trust relationship such as the one above (substituting of course for the actual oidc provider url) and setting values "kube-system" and "external-dns" in system:serviceaccount:kube-system:external-dns for appropriate for the Namespace and ServiceAccount names respectively.

Policies

Further down in this guide we explain how to initialise this repository. For now, just take note that we use placeholder values such as <<__role_arn.external_dns__>> that will be replaced by the actual ARNs of the roles you wish to use. Below is a listing of all of the IRSA roles in use in this repository, along with links to JSON files with example policies. If you do a search on the whole "distribution" folder you find exactly where these placeholders are used.


aws-load-balancer-controller

Needs policies that allows it to provision a NLB in specific subnets.

Needs policies that allows it to automatically scale EC2 instances up/down.


external-dns

Needs policies that allows it to automatically create record sets in Route53.


certificate-manager

Needs policies that allows it to automatically create entries in Route53 in order to allow for DNS-01 solving.


external-secrets

The external-secrets application is a middleman that will create ExternalSecret custom resources in specific namespaces. It can be configured in two ways.

Option 1: Allow the external-secret application broad authority to read and write AWS secrets

Option 2: Allow the external-secret application to assume roles that have more narrowly defined

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/my-cluster_kube-system_external-secrets"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

In addition, we need to grant each role limited access to secrets. We have chosen an approach of limiting access to secrets by namespace, but it is possible to make this more granular if desired.

ExternalSecret for the argocd namespace

ExternalSecret for the kubeflow namespace

ExternalSecret for the mlflow namespace

ExternalSecret for the auth namespace

ExternalSecret for the istio-system namespace

ExternalSecret for the monitoring namespace

Backend types

There are two supported AWS backend types:

Unfortunately at the moment it is not possible to use IRSA in conjunction with Kubeflow Pipelines, which currently uses both the Minio Go and JavaScript clients. On both of those, additional work is needed to enable IRSA. Please see this tracking issue: https://github.com/kubeflow/pipelines/issues/3405

For now, we use an IAM User in order to facilitate writing Pipeline artifacts to S3. The user's credentials are fetched from the AWS Secret Manager using and ExternalSecret. The relevant details for the IAM User are as follows


Deployment

This repository contains Kustomize manifests that point to the upstream manifest of each Kubeflow component and provides an easy way for people to change their deployment according to their need. ArgoCD application manifests for each componenet will be used to deploy Kubeflow. The intended usage is for people to fork this repository, make their desired kustomizations, run a script to change the ArgoCD application specs to point to their fork of this repository, and finally apply a master ArgoCD application that will deploy all other applications.

Prerequisites

Mandatory:

Optional (if using setup_credentials.sh to generate initial credentials as sealed secrets):

The setup.conf file and setup_repo.sh script

This repository uses a very simple initialisation script, ./setup_repo.sh that takes a config file such as the example one, ./examples/setup.conf and iterates over all lines therein. A single line would for example look as follows:

<<__role_arn.cluster_autoscaler__>>=arn:aws:iam::123456789012:role/my-cluster_kube-system_aws-cluster-autoscaler

The init script will look for all occurences in the ./distribution folder of the placeholder <<__role_arn.cluster_autoscaler__>> and will replace it with the value arn:aws:iam::123456789012:role/my-cluster_kube-system_aws-cluster-autoscaler. Please note that comments (//, #), quotation marks (", ') or unnecessary line-breaks should be avoided.

You may add any additional placeholder/value pairs you want. The naming convention <<__...__>> has no functional purpose other than to aid readability and minimise the risk of a "find-and-replace" being performed on a value that was not meant as a placeholder.

The "setup_credentials.sh" script

Finally, if you wish you can use the "setup_credentials.sh" script to generate SealedSecrets that will be used for access to "admin" applications, such as the ArgoCD dashboard (in the future), Dex, Keycloak, the kubeflow admin user etc. This script will generate various random credentials and create a "sealed" representation that is safe to declare in your Git repository.

Run the following commands to install the kubeseal CLI on Linux:

wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.16.0/kubeseal-linux-amd64 -O kubeseal
sudo install -m 755 kubeseal /usr/local/bin/kubeseal

On mac you can use Brew to install the kubeseal CLI:

brew install kubeseal

Next, ensure passlib is installed:

pip install passlib

Deploy the Sealed Secrets controller to the cluster:

kubectl apply -f distribution/argocd-applications/sealed-secrets.yaml

Finally, the script can be run with:

./setup_credentials.sh --email test@test.com --username youruser --firstname Yourname --lastname Yoursurname --password yourpassword

You may leave out any of the input paramaters. In that case, a default value (or generated value in the case of passwords) will be used. Alternatively, environmnet variables can be used instead of input parameters.

Deployment steps

To initialise your repository, do the following:

Start up external-secret:

kustomize build distribution/external-secrets/ | kubectl apply -f -

Start up argocd:

Finally, roll out Kubeflow with:

kubectl apply -f distribution/kubeflow.yaml

If you wish, you may also set up ArgoCD to manage itself, as follows:

Customizing the Jupyter Web App

To customize the list of images presented in the Jupyter Web App and other related setting such as allowing custom images, edit the spawner_ui_config.yaml file.

Bonus: Extending the Volumes Web App with a File Browser

A large problem for many people is how to easily upload or download data to and from the PVCs mounted as their workspace volumes for Notebook Servers. To make this easier a simple PVCViewer Controller was created (a slightly modified version of the tensorboard-controller). This feature was not ready in time for 1.3, and thus I am only documenting it here as an experimental feature as I believe many people would like to have this functionality. The images are grabbed from my personal dockerhub profile, but I can provide instructions for people that would like to build the images themselves. Also, it is important to note that the PVC Viewer will work with ReadWriteOnce PVCs, even when they are mounted to an active Notebook Server.

Here is an example of the PVC Viewer in action:

PVCViewer in action

To use the PVCViewer Controller, it must be deployed along with an updated version of the Volumes Web App. To do so, deploy experimental-pvcviewer-controller.yaml and experimental-volumes-web-app.yaml instead of the regular Volumes Web App. If you are deploying Kubeflow with the kubeflow.yaml file, you can edit the root kustomization.yaml and comment out the regular Volumes Web App and uncomment the PVCViewer Controller and Experimental Volumes Web App.

Updating the deployment

By default, all the ArgoCD application specs included here are setup to automatically sync with the specified repoURL. If you would like to change something about your deployment, simply make the change, commit it and push it to your fork of this repo. ArgoCD will automatically detect the changes and update the necessary resources in your cluster.

Accessing the ArgoCD UI

By default the ArgoCD UI is rolled out behind a ClusterIP. This can be accessed for development purposes with port forwarding, for example:

kubectl port-forward svc/argocd-server -n argocd 8888:80

The UI will now be accessible at localhost:8888 and can be accessed with the initial admin password. The password is stored in a secret and can be read as follows:

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

If you wish to update the password, this can be done using the argcd cli, using the following commands:

argocd login localhost:8888
argocd account update-password

Contributing

Before contributing, please install pre-commit and initialise .pre-commit-config.yaml by running the following from the repo's root directory:

pre-commit install

Please feel free to add features by forking this repo, developing and testing your feature and merging back to master via a Pull Request. We are currently still a small community, but feel free to also report bugs or make issue requests on the issue board!