IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 494 forks source link

Handle unregistered users in BearerTokenAuthMechanism and implement user registration mechanism #10972

Open GPortas opened 3 weeks ago

GPortas commented 3 weeks ago

What this PR does / why we need it:

This PR includes a mechanism for registering new users in Dataverse through a Bearer Token.

BearerTokenAuthMechanism

The Bearer Token authentication functionality (available through feature flag) has been extended to properly handle cases where BearerTokenAuthMechanism successfully validates the token but cannot identify any Dataverse user because there is no account associated with the token. See the current development version of BearerTokenAuthMechanism. (https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/api/auth/BearerTokenAuthMechanism.java#L58).

The BearerTokenAuthMechanism has been extended to return a meaningful error in the scenario described above. Specifically, the calling user would receive a 403 error, indicating that the token is validated but the user is not registered in Dataverse.

Additionally, the entire BearerTokenAuthMechanism class has been refactored, moving authentication and token verification logic to AuthService, since this logic is also utilized by the new user registration functionality included in this PR, and it also results in a simpler version of BearerTokenAuthMechanism, more testable and with more limited responsability.

User registration endpoint

The new endpoint takes the bearer token and a UserDTO object with the required properties to create a user account in Dataverse. It only works if the bearer token feature flag is enabled. A RegisterOidcUserCommand handles the validation of the bearer token and user data, and the account creation process. It can return various errors, including a new InvalidFieldsCommandException, which lists any validation errors in the UserDTO. This exception is now handled by AbstractApiBean like other command exceptions and can be used for future purposes.

Which issue(s) this PR closes:

Special notes for your reviewer:

Suggestions on how to test this:

Visual inspection of the API integration tests. Overall, this PR is extensively tested with unit and integration tests, but it is in the ITs where we can see how the endpoint calls work and the different responses that can be obtained.

We can also run this branch locally and reproduce the same calls made in the ITs using curl.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

Yes, attached.

Additional documentation:

None

github-actions[bot] commented 3 weeks ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 3 weeks ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

coveralls commented 1 week ago

Coverage Status

coverage: 21.941% (+0.09%) from 21.856% when pulling dce7edf437b8dc3414e0e10890ccfba835652b55 on 10959-bearer-token-auth-ext into 61b8046f59565a93b588816c015ebb880f1b91b5 on develop.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

johannes-darms commented 1 week ago

@GPortas Thanks for the PR!

There is an UserInfo Endpoint to obtain information (so called claims) about an OIDC user and in many cases properties are already embeeded into the passed Token and just need to be extracted. Further, it is possible to request consent before transmission of personal data to a thrid party (https://openid.net/specs/openid-connect-core-1_0.html#Consent). Hence, the OIDC specificaiton provides all the needed parts to create user accounts.

I'd rather have an implementation that's inline with the specification than a custom JSON that's sent to an API, because someone needs to get the information for the AuthorisationServer, transform it into the custom JSON and initiate the registration request. This extra code is unnecessary IMHO. There is already a working [WIP] (https://github.com/IQSS/dataverse/pull/10278) for this.

GPortas commented 1 week ago

@GPortas Thanks for the PR!

There is an UserInfo Endpoint to obtain information (so called claims) about an OIDC user and in many cases properties are already embeeded into the passed Token and just need to be extracted. Further, it is possible to request consent before transmission of personal data to a thrid party (https://openid.net/specs/openid-connect-core-1_0.html#Consent). Hence, the OIDC specificaiton provides all the needed parts to create user accounts.

I'd rather have an implementation that's inline with the specification than a custom JSON that's sent to an API, because someone needs to get the information for the AuthorisationServer, transform it into the custom JSON and initiate the registration request. This extra code is unnecessary IMHO. There is already a working [WIP] (#10278) for this.

Hi @johannes-darms,

Thank you for your message. Since certain claims in the OIDC UserInfo endpoint—such as email—are optional and can vary by identity provider, I’ve decided not to rely on this endpoint for retrieving user data. Consistent access to the user’s email is necessary for registration in Dataverse, so I’ve implemented a custom endpoint to ensure we receive all required information reliably during user registration.

In many cases, such as when using Keycloak, user information can be obtained directly from the JWT access token on the client side (e.g., email, first name, last name), as the necessary claims are included—eliminating the need for additional calls. If the token does not include these claims, we can then request the information from the identity provider’s UserInfo endpoint (if supported), or manually complete any missing data.

johannes-darms commented 1 week ago

The set of standard claimslist email so its IHMO save to assume this property is provided. Do you have an example where this property is not provided?

I completely understand your point and the reasoning behind it. However, with this approach we lose a bit of the interoperability that the use of standards and their protocols gives us. As an alternative, we could tell people that their authentication server must support these claims in order to be used with Dataverse?

GPortas commented 1 week ago

@johannes-darms

While the email claim is part of the OIDC standard, not all identity providers include it by default. Some IdPs (e.g., Keycloak, Auth0) allow administrators to configure which claims are returned so it is up to the IdP’s configuration, user consent, and requested scopes whether or not the email claim is made available.

Forcing authentication servers to support certain claims could generate pushback, as decentralized authentication would then depend on the specific needs of one application (Dataverse).

GPortas commented 1 week ago

Moving this to ready for review. Let's open the review and discussion to the team.

qqmyers commented 1 week ago

FWIW: Both Google and ORCID allow people to create accounts where there is no email available (Google doesn't require it, ORCID allows you to not share it publicly/with apps). Google further allows accounts w/o a last name. At QDR, we use the UserInfo endpoint, but we still need a form to get email/last name if they don't exist before we can create an account. (QDR code isn't directly relevant as we do single sign on with Drupal and the form is on that side rather than Dataverse.) It looked like it may be possible to have Keycloak ask the user for missing information, which could have avoided custom code on our end, but we already had the Drupal form and have some other ~demographic questions we ask, so it was easier to do it there.

johannes-darms commented 1 week ago

As I said, I understand your reasons and they make sense. However, it is a very pessimistic and certainly necessary fallback implementation to assume that the authentication server will not fulfil the requirements and the user will have to provide most, if not all, of the information. But in many cases the authentication server provides all the information, or at least can be customised to provide the information needed (In keycloack I can configure the claims per client), and in these cases a dedicated json is not required at all and account generation could work without user interaction or forms, and this advantage is unfortunately not fully exploited with this implementation.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 week ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 3 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 3 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 3 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 3 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 2 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 2 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 2 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 2 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 2 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 2 days ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 day ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

github-actions[bot] commented 1 day ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

GPortas commented 1 day ago

Thank you for your detailed review (This is a big PR!). We’ve implemented several improvements. Please take a look at the latest changes and let me know if you have any additional suggestions. @qqmyers

github-actions[bot] commented 1 day ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:10959-bearer-token-auth-ext
ghcr.io/gdcc/configbaker:10959-bearer-token-auth-ext

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

johannes-darms commented 8 hours ago

Looks good. As discussed in other comments, I think/agree getting more field types from the claims makes sense and will probably be needed at some point, but this code solves the immediate issue and gives code that can be built on.

I totally agree. This PR improves OIDC compatibility quite a bit and builds a solid foundation for further improvements (getting more information from an authorisation server, possibly via configurable claims/scopes per provider).

The only conceptual problem I see with the approach is that the user is not asked for consent before the information flow. This is part of the specification . The required processes and attributes are part of the standard and could be "just" used instead of reinventing the wheel with custom json properties. Terms and conditions, policies and other information that the authorisation server can display can be provided during client registration (https://openid.net/specs/openid-connect-registration-1_0.html#toc) (see Keyclack OID Client -> Advanced>Fine grained OIDC configuration).

IHMO: As stated above, if the required properties (scopes/claims) are not provided by the OIDC server (can be confiured per client) or released by the user for use by a dataverse, the OIDC server configuration is simply not compatible or the user is not willing to participate. Implementing UI/APIs around it to obtain information via another channel does work, but requires additional code, circumvents the standard and somehow questions the use of a protocol with the sole purpose of obtaining identity information.

qqmyers commented 2 hours ago

I think we do still ask the user for consent, it is just that the full terms are not presented there. In some sense, I see this discussion about whether it is better to have Keycloak ask for missing information, or whether the Dataverse client can do it. I'm not sure why one is better than the other (though both are possible). In some sense the argument that you should use Keycloak for this can just be applied further - if you need Keycloak to add info not supplied by, for example, ORCID, you should just require ORCID to add the fields you need or not use it. My guess is that the use case where a Dataverse instance would like to use an OIDC service that is not under their control enough for it to be configured to supply all the fields we'd like is real and having the API/SPA able to handle that case w/o requiring Keycloak seems reasonable. That said, we should definitely support a fully configured Keycloak in the middle too.

I think that means, minimally to make the terms acceptance in json be configurable by flag now, and, in the future, add support for the optional affiliation/role fields to come from claims?