DACCS-Climate / DACCS-executive-committee

Activities of the Data Analytics for Canadian Climate Services (DACCS) executive committee
1 stars 1 forks source link

:raising_hand: allow nodes to authorize each other's users within the network #8

Open mishaschwartz opened 1 year ago

mishaschwartz commented 1 year ago

Topic category

Select which category your topic relates to:

Topic summary

Problem

Users will typically have an account on one node in the network. If they want to access resources from another node in the network (that aren't publicly available) they currently need to create a second account on the other node and log in there as well.

This creates an additional burden on the user:

This creates an additional burden on node administrators:

Lastly, it leads to duplication of user accounts on the network that can cause some minor issues down the line (e.g. if we decide to have a mechanism to email all network users about technical issues, then some users will receive duplicate emails.)

We should implement a system where nodes can be "remote authenticators" for each other within the network and node administrators can authorize access to resources for users registered elsewhere on the network.

Additional information

Proposed Solution

Allow nodes to authenticate users for each other using tokens and provide authorization options that can be generalized to members of specific nodes or the network in general.

Proposed cross-network authn/z process:

Proposed new authorization groups (in magpie):

Process of authorizing a user on another node:

Required changes:

Most of the above changes should probably be handled by magpie/twitcher but we should discuss the best options.

This topic may be of special interest to @fmigneault @huard. Please feel free to tag others who may want to discuss this as well

fmigneault commented 1 year ago

It would be possible to have a dedicated federation node with Magpie and Twitcher running with the various services to support. Then, a node that needs to validate access to its services and their resources by a user simply needs to use the nginx auth_request to send the relevant request detail toward that federated node using /twitcher/ows/verify endpoint. If the OK is received, nginx will automatically resume the local node's request. Otherwise, it blocks it with 403 Forbidden.

This does not need any modification to the current Magpie/Twitcher code. However, this also implies that all services shared across all the nodes must be defined on that federated node. Users should also be created on that federated node. Given that, any authentication endpoint on the local node should be redirected to the federated node as well.

For the token validation portion of a user, if this is still needed, Twitcher also has a /twitcher/verify endpoint (added by MagpieAdapter) that allows to submit the user Cookie to verify if they can be authenticated. https://github.com/Ouranosinc/Magpie/blob/master/magpie/adapter/__init__.py#L131-L140

mishaschwartz commented 1 year ago

Thanks @fmigneault, the info about the twitcher endpoint is very useful. The idea about using nginx to delegate the authentication to other nodes is a good one too and worth considering.

It sounds like you're describing a centralized node that's used by all other nodes for authentication. I'd really like to avoid that if possible since it creates a single point of failure for the network.

I'd like to keep the current status quo of having each node fully responsible for the data/services it provides, including authentication of its users and authorization of its services.

fmigneault commented 1 year ago

@mishaschwartz I agree regarding the single point of failure.

Another option that could be possible is using a synchronization of user permissions between nodes. @ChaamC has an issue about a permission synchronization functionality using Cowbird (mainly intended to re-trigger/fix invalid permissions between services of a given node), but there could be some sort of outgoing requests to another node to create the relevant users and permissions on other instances. The only issue in this case is that each node that wants to sync with another must have some admin-level user on the other node to push new permissions.

Another option, would be to use Twitcher's OAuth2 Tokens](https://twitcher.readthedocs.io/en/latest/api.html#module-twitcher.oauth2), but those are purely for Twitcher controlled access, without any additional capabilities from Magpie related to users.

Again another approach, is to use the Authorization header in request, which will be handled by Magpie/Twitcher here: https://github.com/Ouranosinc/Magpie/blob/master/magpie/adapter/magpieowssecurity.py#L272-L292 How to make nodes automatically add those headers when targeting another node remains to be defined, but if they are added, the request should work transparently as if the user logged manually onto that node beforehand.

huard commented 1 year ago

Only question I have is regarding permission groups. Do those need to be communicated across nodes ?

mishaschwartz commented 1 year ago

@huard No they wouldn't need to be communicated.

The idea is that nodes need to be able to communicate to authenticate users, but then all authorization is internal to a specific node. For example, a user might be authenticated by Node A, but be in different permission groups in Node A and Node B since they have different authorization profiles in each.

I don't think that we would want a change in some permission group in Node A to affect Node B since that would take some of the authorization control away from the node administrator of Node B. Does that make sense?

mishaschwartz commented 1 year ago

I'm thinking that the best solution here might be to implement a token system in Magpie that is similar to what we have in twitcher but is aware of users. Like @fmigneault describes here:

Another option, would be to use Twitcher's OAuth2 Tokens](https://twitcher.readthedocs.io/en/latest/api.html#module-twitcher.oauth2), but those are purely for Twitcher controlled access, without any additional capabilities from Magpie related to users.

We could use a similar mechanism as described here to implement this, but allow passing token values in a header as well as Authorization keys:

Again another approach, is to use the Authorization header in request, which will be handled by Magpie/Twitcher here: https://github.com/Ouranosinc/Magpie/blob/master/magpie/adapter/magpieowssecurity.py#L272-L292

I think that would allow us to make all changes within Magpie and the changes would only be additive so they shouldn't break any backwards compatibility. We could even make the token passing mechanism optional for Magpie (required for a Marble implementation though) and turned off by default so that current users of Magpie for other applications would see no difference.

tlogan2000 commented 1 year ago

Two quick questions:

  1. In general I like the idea of having common users .. If I understand correctly if I sign up for an account on the Ouranos node I can also log in on the UofT node. Users would be probably have better performance if able to switch to the jupyterlab instance of the node where the majoirty of the data of interest is located. In this case though I think it would be interesting to investigate some way of sync user-data between the jupyter accounts ... Use case : users will often upload small amounts of other data (geojson files for subsetting, other non-climatic data etc) that they need/use while processing climate data. It would be nice if this could easily be transferred back and forth. Not sure of the mechanism to make this work but would be cool I think

  2. For protected data-access : what happens if the two nodes each accidentally have a shared overly generic user group e.g. 'protected' or something similar? Would the UofT node protected group be able to accidentally access the Ouranos-node "protected" folders and vice-versa?

mishaschwartz commented 1 year ago

@tlogan2000

In general I like the idea of having common users .. If I understand correctly if I sign up for an account on the Ouranos node I can also log in on the UofT node.

Yes, mostly... the idea is more that if you have an account on Ouranos, you don't need an account on UofT as well to access some resources on UofT. The node admin at UofT can decide something like: "tlogan2000 from the Ouranos node can have access to resource A, B, and C at UofT". However, to do anything browser-based (that would require setting a cookie), you would need an account at both UofT and Ouranos.

The use-case that I'm imagining here is more "I have logged in to jupyterlab at Ouranos and I'd like to run a weaver workflow on the UofT node".

In this case though I think it would be interesting to investigate some way of sync user-data between the jupyter accounts

This is an interesting idea. I don't love the idea of automatically syncing data between user workspaces across the network because that would involve a lot of data transfer between the nodes that may not be necessary. I wouldn't be opposed to a feature that would allow a user to selectively transfer data between user workspaces in a more transparent way. Though, I think that feature we'd want to implement as a second step; first give users access to multiple nodes, second allow data syncing between nodes. I'd be interested to hear other people's opinions on this as well.

For protected data-access : what happens if the two nodes each accidentally have a shared overly generic user group e.g. 'protected' or something similar? Would the UofT node protected group be able to accidentally access the Ouranos-node "protected" folders and vice-versa?

No, each node would not be aware of each other's groups. So even if both nodes have a "protected" group, neither node knows anything about the other one. The only information that nodes will share is that a specific user is authenticated on another node, not anything about that users permissions or group memberships on the other node.

tlvu commented 1 year ago

In this case though I think it would be interesting to investigate some way of sync user-data between the jupyter accounts

This is an interesting idea. I don't love the idea of automatically syncing data between user workspaces across the network because that would involve a lot of data transfer between the nodes that may not be necessary. I wouldn't be opposed to a feature that would allow a user to selectively transfer data between user workspaces in a more transparent way. Though, I think that feature we'd want to implement as a second step; first give users access to multiple nodes, second allow data syncing between nodes. I'd be interested to hear other people's opinions on this as well.

Maybe instead of syncing data between Jupyter instances, which could potentially be quite big so waste of disk space and bandwidth if many instances in our Marble federation, how about sharing those data like Ouranos currently do with public/mypublic folder in Jupyter. Currently the public/mypublic folder only works in Jupyter. Sharing across instances means it has to be exposed to the internet with proper access control so only the same user can have access to. Maybe Magpie + Cowbird could do it. I have no idea how much code change is required for this, just my brain dump.

For the user sharing between different Magpie instances currently discussed here, does that requires code change or just additional config?

mishaschwartz commented 1 year ago

For the user sharing between different Magpie instances currently discussed here, does that requires code change or just additional config?

It would require a code change

tlogan2000 commented 1 year ago

selectively transfer data

@mishaschwartz Sorry for being unclear but in my mind this would not be an 'auto-sync' but indeed something that the user would do selectively. Users on the Ouranos node can and do generate a reasonable amount of output (.nc files etc etc) and I agree it is likely not be a great option to automatically transfer everything that resides in the user workspace

fmigneault commented 1 year ago

The node admin at UofT can decide something like: "tlogan2000 from the Ouranos node can have access to resource A, B, and C at UofT". However, to do anything browser-based (that would require setting a cookie), you would need an account at both UofT and Ouranos.

Behind the scene, regardless of the approach, a pseudo "tlogan2000-ouranos" (could be any name, an UUID, etc.) would have to be created as a Magpie "user" to resolve permission access. Any request (browser-based or not) requires some kind of "user" to resolve permissions against. Even when validating a group permission, it is the user's membership to given groups that is resolved for access rights. The algorithm cannot grant access if it has nothing to resolve against (ie: the "user"). The Cookie, Basic Auth, Bearer token, etc. are only the methods to indicate who that "user" is, but are essentially equivalent after the identity (authentication) was resolved. The authorization part requires a "user". Note that this "user" could simply be a concept, a bot, or whatever else, not necessarily an actual person with a profile. Therefore, there would always be at least one "user" account for each node.

Sharing across instances means it has to be exposed to the internet with proper access control so only the same user can have access to. Maybe Magpie + Cowbird could do it. I have no idea how much code change is required for this, just my brain dump.

For the user sharing between different Magpie instances currently discussed here, does that requires code change or just additional config?

Reading from a protected HTTP endpoint per user is something being worked on by @ChaamC. It would be supported by Cowbird granted that https://github.com/bird-house/birdhouse-deploy/tree/master/birdhouse/optional-components/secure-data-proxy is properly configured (relates to https://github.com/bird-house/birdhouse-deploy/pull/360).

However, writingto that location (ie: push data to the other node) is not supported, since WPS outputs are not intended for this purpose. This affects some of the design choices defined in the current implementation, which impacts how to manage HTTP vs FileSystem file/dir-access, which are not that trivial with the multi-services permission synchronization that Cowbird must accomplish.

mishaschwartz commented 1 year ago

@fmigneault

Therefore, there would always be at least one "user" account for each node.

Yes, good idea. The idea is to create one "anonymous" user for the network, and one for each node.

mishaschwartz commented 1 year ago

I'm going to start working on a change to Magpie in order to implement the ideas discussed here

fmigneault commented 1 year ago

Let me know if you encounter an issue regarding Magpie/Twitcher.