cortexlabs / cortex

Production infrastructure for machine learning at scale
https://cortexlabs.com/
Apache License 2.0
8.02k stars 605 forks source link

Support aws_session_token for CLI auth #1134

Closed deliahu closed 3 years ago

deliahu commented 4 years ago

Description

In order to authenticate with the Cortex operator, the Cortex CLI should be able to use aws_session_token (currently only static credentials are supported).

Also, consider enabling auth via IAM role (e.g. inherited from Lambda, EC2)

sp-davidpichler commented 4 years ago

It seems this might be problematic as the credentials are hard coded into ~/.cortex/cli.yaml. Would it be better to support a reference to an aws profile in ~/.aws/credentials?

sp-davidpichler commented 4 years ago

Another thing: when you do cortex deploy, does the deployed docker image use your currently configured credentials or the credentials you set up the cluster with? If it's your current credentials this probably can't be supported.

deliahu commented 4 years ago

@sp-davidpichler the deployed docker image uses the credentials that you set up the cluster with (or cortex_aws_access_key_id if you specified that during cluster creation), so it should be possible to support aws_session_token for the CLI commands.

You are correct that it might be nice check for credentials in ~/.aws and/or in your environment variables to support this, since updating ~/.cortex/cli.yaml (or running cortex env configure) each time the session token changes could be cumbersome. Does the session token in ~/.aws/credentials automatically get regenerated over time, and if so, how does that process work?

sp-davidpichler commented 4 years ago

We are using https://github.com/sportradar/aws-azure-login to generate a token with azure AD. You would run aws-azure-login --profile <profile_name> which would generate a token that would be valid for 10 hours. Having to also run cortex configure as well would not be ideal.

deliahu commented 4 years ago

I see, that makes sense, thanks for sharing that

jackmpcollins commented 3 years ago

We would like to use cortex but my company uses AWS IAM roles for employees so this issue is blocking. I think the best solution would be to clearly distinguish CLI AWS credentials from cluster AWS credentials, particularly the way in which these are obtained.

Proposal:

In future:

I am happy to take a shot at this over the next week or so if this solution seems reasonable.

Related issue: https://github.com/cortexlabs/cortex/issues/741

deliahu commented 3 years ago

@jackmpcollins Yes you make a lot of great points; we definitely want to improve this experience, and make it possible for you to use Cortex!

In fact, we were actually discussing this morning about how we could improve this, and we came up with a proposal that I'd like to run by you to see if it would work for you. We think it would make sense to go one step further and decouple the CLI auth from AWS credentials entirely, since they really aren't related. There are a few advantages of this approach, including adding clarity that the CLI auth is not related to the cluster credentials, and that it would work the same regardless of cloud provider (we support AWS/GCP now and would like to add Azure soon).

It could look something like this: cortex cluster up creates a cluster password, and automatically configures your CLI to use it. cortex cluster info shows the password in case you need to configure the CLI on a new machine or you lost your password (of course, you can only run cortex cluster info if you have the AWS credentials). cortex env configure will prompt for your cluster password instead of AWS credentials. We could also add a new command like cortex cluster reset-password or cortex cluster set-password PASSWORD which would update the password.

Even with this change, there would still be two relevant sets of AWS credentials: the credentials used to spin up the cluster (which require significant access), and the credentials that are persisted in the cluster (require much less access). But it will be easier for users to understand this compared to current approach, which involves three sets of credentials.

Does this make sense to you?

jackmpcollins commented 3 years ago

We would prefer to use AWS credentials with the CLI. This makes it easy to limit permissions per user/role and grant/revoke access using our existing methods (terraform). We may want some users/services to have permission only to list Cortex APIs, some to additionally be able to deploy/delete, and some to have full ability to run cluster up/down. Using IAM policies for the CLI users allows this granularity.

The most straightforward mental model for us would be if the Cortex CLI functioned similarly to kubectl. We use saml2aws to enforce MFA and retrieve temporary AWS credentials. These are stored in ~/.aws/credentials. When setting up a new kubectl context we can optionally associate it with an AWS profile from ~/.aws/credentials. This shows as an argument to the aws command in the ~/.kube/config file:

users:
- name: arn:aws:eks:us-west-2:000000000000:cluster/cortex-dev
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      args:
      - --region
      - us-west-2
      - eks
      - get-token
      - --cluster-name
      - cortex-dev
      - --profile
      - dev
      command: aws
      env: null

Our usage of Kubectl then looks like:

Alternatively if the AWS profile is not added to the kubeconfig, we can run saml2aws exec -a dev -- kubectl get pods which executes the command with credentials as environment variables. Because kubectl uses the AWS credentials chain these environment variables take precedence over the "default" AWS profile.

The kubectl commands return a permission error when the AWS user/role does not have the required permissions.

We would expect the cortex usage to look similar:

The cortex commands return a permission error if the AWS credentials provided do not have the permissions required for the command. So rather than three sets of credentials (Admin, cluster, and CLI-user), it is just two (CLI-user, cluster).

sp-davidpichler commented 3 years ago

I think it's also a good idea to keep cluster authentication tied to AWS IAM authentication. Our IAM credentials are directly linked to our company SSO. I think the best option would be for the cortex cli to respect more of the AWS auth related parameters like AWS_PROFILE and AWS_SESSION_TOKEN. Or make a move to only support AWS profiles(minus the cluster credentials).

I do think however the move to three sets of logical credentials makes sense:

  1. One set of credentials to provision the cluster
  2. One set of credentials to run the cluster with (which might also be handled by node instance roles which are created by eksctl)
  3. One set of credentials to get, list, and deploy APIs.

A workflow could look something like

jackmpcollins commented 3 years ago

I don't see the need to distinguish between credentials for provisioning the cluster and credentials for using the CLI. I think it's most simple from the user's point of view to have just two sets of credentials:

  1. Credentials for the user currently using the CLI
  2. Credentials the cluster uses to manage itself

Credentials for the current user should be obtained following the AWS order of precedence so that Cortex CLI works the same way as kubectl and AWS CLI. To run cortex cluster up the current user must be authenticated as an IAM user/role with AdministratorAccess. If they want to use a less privileged AWS profile after spinning up the cluster they can configure a new cortex env with this profile for themselves. Non-admin users will be able to use (a subset of) cortex commands against the cluster created by an admin user.

Cluster credentials should be manually specified when running cortex cluster up or automatically created as node instance roles or similar as you suggest @sp-davidpichler 👍 . I don't agree that cluster credentials should be stored in ~/.aws/credentials because these aren't for human use and are only provided once - when spinning up the cluster. I think ideally the cluster credentials/user/role would be created and managed by cortex rather than provided by the user. This way a user only has to think about their own AWS credentials.

deliahu commented 3 years ago

@jackmpcollins @sp-davidpichler you both make great points. We're still formulating our thoughts around this, so your feedback is timely and much appreciated.

I think it's most simple from the user's point of view to have just two sets of credentials

I think this is a good generalization of the 3 sets of credentials that @sp-davidpichler and I were mentioning, which makes sense when pulling the "current user" credentials from the AWS order of precedence (which we could let you override by passing in the profile name). In practice, there will likely be the three sets of credentials in terms of actual permissions, but # 1 and # 3 would behave the same way in terms of how those credentials are loaded, so there would be only two mechanisms for loading credentials.

I think ideally the cluster credentials/user/role would be created and managed by cortex rather than provided by the user.

This is an interesting idea. I could still see a use case for users specifying which credentials to use in the cluster if they want to reduce access, since if Cortex created an IAM user for the cluster, it would probably have to take more permissions than would be necessary (for example s3:* instead of just the bucket that contains the models). So maybe cortex would create/manage it automatically, but there would be a way to override the permissions granted.

We would prefer to use AWS credentials with the CLI. This makes it easy to limit permissions per user/role and grant/revoke access using our existing methods (terraform). We may want some users/services to have permission only to list Cortex APIs, some to additionally be able to deploy/delete, and some to have full ability to run cluster up/down. Using IAM policies for the CLI users allows this granularity.

Access control is a great idea, and I've created https://github.com/cortexlabs/cortex/issues/1748 to track it. Currently this isn't supported; if you have valid AWS credentials, you have access to all API-related CLI commands (e.g. get, deploy, delete, but not the cluster commands). My question is: does it make more sense to implement Cortex's RBAC via IAM roles/permissions, or within Cortex itself? If it's fully managed by Cortex, there would be commands like cortex cluster create-user, cortex cluster create-role, cortex cluster assign-role, etc. Do you know if it's possible to do this fully natively in IAM, for example by making custom actions/resources that can be assigned to IAM users/roles, like cortex:deploy, cortex:get, etc? I just did a quick test, and although it gave me a warning, I was able to create an IAM policy with made-up actions, but I wasn't able to use made-up resources. If it's not possible to fully rely on IAM, then we might still need to create cortex commands to grant access to users/roles, like cortex cluster configure-access-policy which would take in the IAM user/role and which Cortex permissions to grant it. If that is the only way to use IAM, it seems like having all of the RBAC live inside of Cortex could be more straightforward. I'd love to hear your thoughts on this.

AWS profiles

Regarding from where to pull the AWS credentials: I agree that using AWS profile names, or using the default chain if the profile name is not specified, makes sense. Let's discuss the details after deciding how to best handle CLI-to-cluster auth, since I think the details will be affected by that decision.

I'd also be happy to jump on a call if at some point you feel it'd be easier to discuss this live; feel free to email me at david@cortex.dev if you want to find a time.

sp-davidpichler commented 3 years ago

I think cortex keeping a list of IAM users/roles with access to different functions would be best case for us. This is pretty much how access to EKS clusters deployed with eksctl already work. The user/role that deployed the cluster is initially the only one with access via kubectl, but you can add user/roles with a command like

eksctl create iamidentitymapping -c ${cluster_name} -r ${region} -p ${profile_name} \
    --arn "arn:aws:iam::${account_id}:role/${role_name}" \
    --username ${account_name} \
    --group ${group_name}

So there could be a predefined set of cortex groups like {admin, developer, read_only} that user/roles be added to for simple use cases.

deliahu commented 3 years ago

I see, yeah that makes sense.

What would you say is the advantage of the IAM-based approach when compared to cortex accounts? Is it mostly the convenience of not having to worry about creating cortex users / saving cortex passwords (versus creating/reusing IAM users, creating identity mappings, and using IAM-based credentials for login)? Or is there another advantage? And how strong of a preference do you have, i.e. would only supporting cortex-based accounts and not supporting IAM-based auth be a deal breaker, or would you be able to use cortex-based accounts if necessary?

If I understand the proposals correctly, at a high level, they would be something like this:

IAM-based

# first, create an IAM user that you want to grant access to, or identify an existing IAM user

cortex cluster set-role --arn <arn> --role [admin|developer|read_only]
cortex cluster revoke-access --arn <arn>

With this approach, CLI commands would use the default AWS credentials chain, or the user could pass in the name of an AWS profile. Optionally, cortex env configure could ask you which AWS profile you want to associate with that cluster, and remember your choice in the future (and not use the default chain).

Cortex-based

# this command would print or download a password
cortex cluster create-account --username <username> --role [admin|developer|read_only]

cortex cluster delete-account --username <username>

With this approach, cortex env configure would prompt for the cortex cluster credentials, and CLI commands would use those credentials. There would be no IAM involvement.

jackmpcollins commented 3 years ago

What would you say is the advantage of the IAM-based approach when compared to cortex accounts?

We have a strong preference for having this be IAM-based. We would be granting access to the cortex CLI to IAM roles rather than individual users, with users temporarily assuming these roles using our SSO.

Currently this isn't supported; if you have valid AWS credentials, you have access to all API-related CLI commands (e.g. get, deploy, delete, but not the cluster commands).

Ah I misunderstood this. I thought the user's cortex commands could be restricted by their IAM policies or kubernetes RBAC, but if I'm understanding correctly now the cortex operator just checks that the user's AWS creds are valid for the same AWS account and then its own permissions are used to actually execute the command.

Do you know if it's possible to do this fully natively in IAM, for example by making custom actions/resources that can be assigned to IAM users/roles, like cortex:deploy, cortex:get, etc?

I'm not familiar enough with IAM to answer this but another question in the same vein: would it be possible (and reasonable) for the cortex cluster set-role --arn <arn> --role [admin|developer|read_only] command to just be a wrapper over eksctl create iamidentitymapping or kubernetes RBAC? We would like to have permissions declared and managed in terraform code and it is possible to do this for IAM roles/policies and kubernetes RBAC. Our rough plan is to use cortex cluster up to create the cluster, then use terraform to manage access and permissions.

deliahu commented 3 years ago

if I'm understanding correctly now the cortex operator just checks that the user's AWS creds are valid for the same AWS account and then its own permissions are used to actually execute the command.

Yes, that is correct

Regarding having cortex cluster set-role be a wrapper around eksctl create iamidentitymapping: That is an interesting idea, however it could limit our ability to support fine-grained access control in the future. For example, if cortex managed the RBAC rather than relying on Kubernetes, we could more easily support granting access to specific APIs (e.g. specifying that a user can update one API but not others). This might be possible to do natively with Kubernetes RBAC via Custom Resources, but it would require us to re-architect some of our backend (right now we use native Kubernetes resources like services, deployments, etc). Would it be possible to automate the call to cortex cluster set-role in your terraform, or are you asking this question because it is only possible (or is easier) for you to automate eksctl create iamidentitymapping plus kubernetes RBAC?

jackmpcollins commented 3 years ago

I don't foresee any problems with us using a cortex cluster set-role command. I was mostly trying get an understanding of how RBAC might be implemented and how consistent/integrated this could be with our existing permissions. Thanks for the explanation.

jackmpcollins commented 3 years ago

@deliahu would you be able to give an estimate for when the original issue here of supporting aws_session_token for CLI auth might be resolved please?

deliahu commented 3 years ago

@jackmpcollins We're planning to have a team discussion about it tomorrow, and I will keep you posted on our prioritization for it. On what timeline would you need to use it by, and do you have an interim workaround solutions or is it a full blocker?

jackmpcollins commented 3 years ago

Great, thanks. We would like to be able to use AWS IAM roles / session_token for CLI auth by the end of January. We can use user accounts for most of the setup/development but will require roles to work before we allow others to use cortex.

deliahu commented 3 years ago

@jackmpcollins Today we decided to spend some time researching/designing this in the next week or two, so I'd expect that we could start working on it in time to be released in v0.28 (we release every two weeks on Tuesday, so that would be Feb 2nd). Would that timing work for you?

jackmpcollins commented 3 years ago

That would be great! Thank you!

vishalbollu commented 3 years ago

This conversation has a lot of useful information that extend beyond the scope of the initial ticket. The ideas discussed in this ticket regarding authorization have been summarized and moved to this ticket https://github.com/cortexlabs/cortex/issues/1748. Feel free to add any additional context.