user_id format and uniqueness guarantees

soxofaan commented 3 years ago

https://github.com/Open-EO/openeo-api/blob/f303d65a3291d4cd74dacc0e796803bb5d6fa03b/openapi.yaml#L5288-L5295

generated internal identifier from the provider

In case of OpenID Connect in general and EGI Check-in in particular (as used for openEO Platform), the only guaranteed claim that is usable as unique identifier is the sub claim, which is not guaranteed to follow the regex ^[\w\-\.~]+$. For example, EGI Check-in returns an email-looking sub, e.g. 9433eda330....@egi.eu.

The only official specification I found so far for the sub claim is in https://openid.net/specs/openid-connect-core-1_0.html#UserInfoResponse:

It MUST NOT exceed 255 ASCII characters in length.

Which is pretty liberal, and could be even read as "any byte string of at most 255 bytes will do". No real-world OIDC provider will probably use weird ASCII control characters, but they still might use normal characters that cause trouble when used in certain parts of the API (URL paths, process namespaces, ...) because there are used as separators or something like that (/, :, @ ...)

Another related, but smaller, more theoretical problem is that openEO back-ends are allowed to support multiple OIDC providers, which could break the uniqueness of the user ids across multiple providers (and/or basic authentication).

m-mohr commented 3 years ago

When writing this I expected that the back-ends have separate user storage databases or similar where they create new users once a new user registers (logins) via an external IP. In this case, a valid ID can be generated. Is that not true here? You need to store user-related data (name? contact details? billing? plan? settings? ...) anyway or do you directly store all user-related (meta-)data with EGI?

soxofaan commented 3 years ago

The API spec currently states vaguely

This is usually a randomly generated internal identifier from the provider

So the problem is not in the API spec itself, and it's up to the back-end implementer to workaround the problems mentioned above, but it's still a bit misleading and hiding some implementation details that are worth mentioning.

m-mohr commented 3 years ago

This is usually a randomly generated internal identifier from the provider

This is actually not an ideal remark in the spec. Those user IDs were meant to be used in the future for sharing etc and thus something "readable" would be great, like @soxofaan or @m-mohr here on GitHub. Maybe we should change the wording with regards to that. I doubt we should use the OIDC subs for such things.

soxofaan commented 3 years ago

separate user storage databases or similar where they create new users once a new user registers (logins) via an external IP. In this case, a valid ID can be generated. Is that not true here?

Probably but not necessarily (e.g. the current version of openeo-aggregator backend is stateless and just uses/forwards user identification tokens provided by OIDC). Also, the whole point of OIDC is to offload all the challenges of user management to an OIDC provider, and if that OIDC provider provides a reliable unique user identifier why couldn't a backend use that directly?

m-mohr commented 3 years ago

Probably but not necessarily (e.g. the current version of openeo-aggregator backend is stateless and just uses/forwards user identification tokens provided by OIDC). Also, the whole point of OIDC is to offload all the challenges of user management to an OIDC provider ...

Yes, but the aggregator just proxies to other back-ends, which itself have to handle users somehow. I mean you need to check whether someone has paid their bills or store billing info etc. That's usually something you don't do via an IP (EGI might be an exception here, but think about login via Google, FB, GH etc).

if that OIDC provider provides a reliable unique user identifier why couldn't a backend use that directly?

Sure, if that's a good identifier that complies to the rules you can use it. But what happens if I delete my account at EGI and want to use another auth provider instead? That happens from time to time if you login via multiple IP providers. At least there are setups like that.

soxofaan commented 3 years ago

This is usually a randomly generated internal identifier from the provider This is actually not an ideal remark in the spec.

Actually: the "provider" in that statement: does that refer to OIDC provider or to the openEO back-end as provider?

Those user IDs were meant to be used in the future for sharing etc and thus something "readable" would be great, ... I doubt we should use the OIDC subs for such things.

note that the description currently also states this as well:

not meant for displaying purposes.

m-mohr commented 3 years ago

Actually: the "provider" in that statement: does that refer to OIDC provider or to the openEO back-end as provider?

back-end

note that the description currently also states this as well:

not meant for displaying purposes.

Yeah, that was not written very well. What I meant to say here that for the response { user_id 'jon_doe123', name: 'John Doe'} the UIs should display 'John Doe" and not the id.

soxofaan commented 3 years ago

To give a bit of context why I'm bringing this up: I am working on UDP sharing (just simple non-fine-grained public sharing) so I am looking for some kind of user handle that is guaranteed to be unique and does not mess with the syntax of URLs, namespaces, ...

m-mohr commented 3 years ago

So it was meant to be (and should be clarified to be?) something like:

A unique user identifier at the back-end provider. It is meant to be used as an identifier in URIs (e.g. for sharing purposes). To display a name in a UI where the actual name of the user should be shown, prefer to show the name instead of the id.

m-mohr commented 3 years ago

For your use-case, the user id was meant to be used, indeed. That's why the regex is in place.

soxofaan commented 3 years ago

Actually: the "provider" in that statement: does that refer to OIDC provider or to the openEO back-end as provider? back-end

ok good to know, I was reading it as OIDC provider

m-mohr commented 3 years ago

Okay, I'll prepare a PR that will clarify this. Probably an improved version of what I've quickly drafted an posted above: https://github.com/Open-EO/openeo-api/issues/404#issuecomment-870659662

soxofaan commented 3 years ago

Thanks for the feedback and clarification.

Overall I think there is some kind of missing link in the API spec or its documentation:

if I understand correctly: a back-end implementation SHOULD have some kind of user registration mechanism:
- in addition to standard OIDC authentication/authorization, where at least some kind of "user id" is constructed and possibly other properties (billing, privacy agreements, ...) are set up
- (or to support basic authentication)
how and when should this happen (before first usage of a backend, automatically on first access, ...)?

Of course that whole additional registration flow should not be part of the openEO API, but apparently some aspects should be specified or documented in some way, e.g.

the fact that there SHOULD be some kind of additional registration (e.g. not only for technical but also for legal reasons)
user_id regex
the behavior of API when user is authenticated with OIDC but not yet "registered" with back-end? Should there be HTTP forwarding mechanisms or dedicated error codes so that clients can streamline the user experience?

m-mohr commented 3 years ago

if I understand correctly: a back-end implementation SHOULD have some kind of user registration mechanism:

Depends on the back-end structure. If there's an existing user pool, it could be that no additional things are required (example: GEE).

By the way, have you read https://open-eo.github.io/openeo-api/draft/index.html#tag/Account-Management ? It also mentions OpenID Connect User Registration (although that's more useful if back-end = IP).

how and when should this happen (before first usage of a backend, automatically on first access, ...)?

Depends. back-ends are free to choose here. What probably doesn't work very well in openEO in contrast to other OIDC services is the "on-demand" registration once a new OIDC sub is approaching the back-end. Simply not possible because we have multiple clients and the clients don't know how the registration for the back-end should work or look like. So a user must be either automatically registered or register before the back-end is actually used. The API does not mandate any behavior here.

the fact that there SHOULD be some kind of additional registration (e.g. not only for technical but also for legal reasons)

see the link above

user_id regex

What's the to-do here?

the behavior of API when user is authenticated with OIDC but not yet "registered" with back-end? Should there be HTTP forwarding mechanisms or dedicated error codes so that clients can streamline the user experience?

see above

m-mohr commented 3 years ago

The commit above adds a clarification for user_id.

Additionally, new Relation types were added:

GET /: create-form to link to the registration page
GET /: recovery-form to link to the credentials recovery page.
GET /me: New Relation types alternate and related for user-specific external pages. #404

I hope this covers (most/all) aspects of this issue. Otherwise, please let me know.

Open-EO / openeo-api

user_id format and uniqueness guarantees #404