Run computations in user accounts

mrocklin commented 4 years ago

Currently the deployment of beta.coiled.io launches resources in Coiled Inc.'s AWS account. When companies come to us asking for more security we say "Sure, we'll deploy the Coiled infrastructure in your account. Here is a tarball, a terraform script, and @necaris".

However, we could also launch resources in the user's account if they gave us sufficient permissions to do so. In principle the user could construct an IAM role that let Coiled create ECS tasks, log groups, and so on, and we would use that role on the user's behalf whenever they wanted to launch resources. We could do the same with Azure/ACI whenever that comes online.

However, there are some open questions here and it would be good to get some feedback from advanced users:

Setup and Runtime Permissions: When setting up with groups previously we've needed to highlight specifically which permissions we need when setting up Coiled, and then which permissions we need for daily operation. I'm curious to know what these are, how they might change with an approach where we keep command-and-control in our account, and how onerous the daily permissions are for users.
Security and telemetry: Coiled currently keeps track sensitive information like TLS credentials, github credentials, or the IAM role itself in its database. These entries are encrypted, but owners of our AWS account (folks like myself and Rami) could probably misuse this access. On a smaller note, Dask clusters launched with Coiled send back performance telemetry. This data is way less sensitive, but is still a constant flow of information out of a company.
Fine-grained role tracking: some groups have expressed a strong interest in tracking activities on a per-user basis using their existing cloud roles. Using a single IAM/AzureAD role to deploy resources in the user's account may not suffice. We may want to generate or use an role per user.

mrocklin commented 4 years ago

I was chatting with folks from TIleDB today. They said that their current approach is to ask users for AWS access keys and credentials and store them. Docs here

This seemed informal to me, but apparently their users are happy with the experience, and it's also fairly accessible for folks who are less technical.

@marcosmoyano it might be worth thinking about how we might ask for AWS credentials, store them securely, and then use them when creating aiobotocore sessions. This seems similar to the multi-region work.

necaris commented 4 years ago

One other thing to note:

For our telemetry and access control (i.e. proxying) to work as it does now, we'll need access into the (presumably private) networks where the clusters are running. This might be tricky if they're in another account.

mrocklin commented 4 years ago

For telemetry I think that the scheduler reaches out to Coiled, so if the scheduler has outbound network access I would guess that this is ok.

For proxying though yeah, that makes sense. I guess this creates the question, are companies comfortable having publicly accessible network addresses, given that they're secured through TLS.

marcosmoyano commented 4 years ago

it might be worth thinking about how we might ask for AWS credentials, store them securely, and then use them when creating aiobotocore sessions. This seems similar to the multi-region work.

On the surface, this seems pretty straight forward. I do share Rami's concern about proxying

necaris commented 4 years ago

For telemetry I think that the scheduler reaches out to Coiled, so if the scheduler has outbound network access I would guess that this is ok.

Yep, basically fine, although if the scheduler <-> Coiled communication is happening over the public Internet rather than within our VPC we might want to do a little more with custom TLS certs to ensure that communication is secure.

For proxying though yeah, that makes sense. I guess this creates the question, are companies comfortable having publicly accessible network addresses, given that they're secured through TLS.

Not sure what you mean about TLS? If we're not able to proxy, they'd no longer be served on our domain, so we'd need alternative arrangements for TLS. The other part is that Coiled's access control (e.g. notebooks / dashboards only visible by creator) wouldn't work without our proxying.

We could do things with DNS and / or install a proxy microservice into the user's account to handle the proxying / auth for us -- certainly not insurmountable -- just saying it's another aspect that needs thought.

mrocklin commented 4 years ago

If we're not able to proxy, they'd no longer be served on our domain, so we'd need alternative arrangements for TLS

I'm suggesting that we continue to proxy, but that we open up the network on the scheduler machine. Hopefully this is ok because communications are secured through TLS.

install a proxy microservice into the user's account to handle the proxying / auth for us -- certainly not insurmountable -- just saying it's another aspect that needs thought.

Yeah, I need to learn more here to understand the options.

scott-coiled commented 4 years ago

I'm sorry if this is a naive question, but why wouldn't we support SAML and/or OAuth so our customers could allow any user that should have access to do so - and it would be up to them to set it up? They could specify who the Coiled "admin" is on their end that can set all this up and manage the telemetry, and then the regular "users" who can run jobs but not do anything else?

necaris commented 4 years ago

@scott-coiled forgive me if I'm misunderstanding you, but I think you're thinking of things the wrong way round. For us to act on behalf of a customer to run compute in their account it's not on us to grant permissions, but to support whichever method(s) the cloud platforms use?

scott-coiled commented 4 years ago

yeah, I was originally thinking I was, but then I decided I wasn't. If we supported SAML/OAuth, then the customer could add Coiled as an application they access with their IdP and set the rules as to who can access it and who has what rights in that application. If the customer is using AWS and leverages Cognito for example, then they would setup access to Coiled via Cognito. Customers might also be using 3rd part IdP's like Azure, ADFS, Ping, Okta, etc.

Does this make sense, or do I really have this backwards?

necaris commented 4 years ago

@scott-coiled I still think you're approaching this wrong. Correct me if I'm wrong, but what you're suggesting is:

Customer currently uses (e.g.) Active Directory to track their users, assign them permissions, etc
Customer adds Coiled to Active Directory and assigns chosen users permissions to access Coiled
Coiled supports SAML / OAuth so that the permissions given using Active Directory are respected, and users can sign in to Coiled via Active Directory

Is that accurate?

scott-coiled commented 4 years ago

@necaris I'm not thinking AD - it doesn't support federation. I'm thinking ADFS, or Azure, or an IdP like Okta or Ping. Those systems usually "ingest" user data from AD or LDAP, and then the rules about what users can access is managed via the IdP.

necaris commented 4 years ago

@scott-coiled Sure, AD was just an example, but I'm glad to know I'd understood you correctly. Unfortunately that's a slightly orthogonal question to the one we're asking here.

Currently, assuming Coiled has a customer FooCorp with an employee Alice:

Alice signs in to Coiled with GitHub or Google auth and has an account under her name, and is also a member of an account for FooCorp
When she spins up a cluster under FooCorp's account, Coiled makes a request to AWS with Coiled's own AWS credentials, and makes an entry in Coiled's database associating the cluster to FooCorp
AWS runs the cluster within Coiled's AWS account and VPCs, and bills Coiled for the nodes to run the cluster. If we weren't in a free beta right now Coiled would turn around and charge the cluster's owner (i.e. FooCorp).
Resources and access limitations already existing within FooCorp's AWS account (e.g. S3 buckets that only certain users can see) need to be somehow granted to the cluster

My understanding of this issue is that we want:

When Alice spins up a cluster under FooCorp's account, Coiled makes a request to AWS under FooCorp's credentials
AWS runs the cluster within FooCorp's AWS account and VPCs, and bills FooCorp for the nodes.
Since the cluster is running within FooCorp's account, it can access private resources appropriate from within that account

In this formulation, it seems to me you're asking about how Alice signs in to Coiled, and I'm concerned about how we get / manage / correctly deploy FooCorp's credentials, and how we can do so not just for AWS, but for other cloud providers (most notably Azure) as well.

scott-coiled commented 4 years ago

@necaris - ok, I think we are actually thinking about the same problem. But just to be sure, let me add a few thoughts:

I see two different types of corporate use cases. In use Case #1, FooCorp is fine running everything in Coiled's infrastructure. In that case, we have total control over user Authentication, so we can support GitHub, Google, etc. All is good, nothing to see here.

In use Case #2 (which I think will be much more typical), FooCorp wants to manage access to Coiled using their own credentials and existing AuthN framework. In this scenario, it would be reasonable to ask FooCorp to add a new user (Coiled), that has certain permission/access. Short of that, we'd be required to use all the existing user credentials for the Data Scientists and Engineers at FooCorp (Alice++)

My thinking here is that we should support the "native" AuthN for the cloud platforms that we will support. As this may be in use in some cases. But larger companies will have even more sophisticated AuthN strategy, and this is where I was saying supporting SAML/OAuth would be valuable. Basically, I envision us having documentation something like this - https://docs.jamf.com/jamf-connect/1.19.2/administrator-guide/Integrating_with_an_Identity_Provider.html.

Does this make sense?

shughes-uk commented 1 year ago

We now launch in user accounts!

coiled / feedback

Run computations in user accounts #67