Allowed characters: User IDs and namespaces

m-mohr commented 1 year ago

Currently, the following is allowed in the API:

process namespaces (in process graphs) are not restricted wrt characters
user IDs are allowed to be: ^[\w\-\.~]+$

In #348 it became obvious that the user ID is likely to be used in process namespaces, but with #348 they are also meant to be part of the URI path. While you can use URI encoding, it's not ideal to allow all characters in the path segment, especially / is an issue here (e.g. due to https://github.com/OAI/OpenAPI-Specification/issues/892).

Some things we've seen in implementations:

in the VITO implementation an "e-mail address" like construct (the EGI id) is used as namespace and user ID(?), which would not be valid right now.
In JS we do parsing of formulas, which includes processes and their namespaces. Thus symbols from math are problematic in namespaces (and we already run into inconsistencies with the -). The format is usually process@namespace.

I think the user ID and process namespace should cover the same range of allowed characters. While we can probably extend the allowed user IDs without issues in the API, but it might lead to issues when you store files under a folder with the user ID. We should probably restrict the namespace, but we can't enforce it directly in the process graphs in 1.x. Thus we could implicitly do so through #348 (and then it only applies if you implement the endpoints) and only later also apply it to the process graph.

The following set of characters could be allowed for both: ^[\w\-\.~\@]+$ (the : is not allowed in Windows filenames)

soxofaan commented 1 year ago

regarding

in the VITO implementation an "e-mail address" like construct (the EGI id) is used as namespace and user ID(?), which would not be valid right now.

@SerRichard @LukeWeidenwalker : In the EODC implementation you use a user id like us-28h3d94d-c534-4cb0-a4ec-7ae3k2 (e.g. as returned by GET /me). I guess you use an internal mapping to/from EGI user ids for that (or is it some kind of direct hashing/encoding?). Apart from adhering to ^[\w\-\.~]+$ , do you know if there is/was any other reason not to use the EGI user id directly? (e.g. to have ability to link multiple EGI user ids to single account?)

SerRichard commented 1 year ago

@soxofaan Both assumptions for these questions would be the reasons I'd offer for why it was set up like this. Currently we are undergoing a refactor for the api implementation on our side. This is actually a workflow we are likely to keep in place, do you know any obvious reasons not to do this? Though, from #348 my understanding is a backend has some internal id decoupled from the external ids given by providers?

@m-mohr Is there a reason the process namespaces can't be defined by the user?

I'm thinking using an id would probably lead to all back-ends effectively having different values for "/processes/{namespace}/{process_id}". Would it not also be an option to have an endpoint for creating a namespace? With requests going through the aggregator, then only namespaces that have not been used by any backend can be considered available. To me this approach could be simpler programmatically than the aggregator having to locate and merge diverging namespaces from different back-ends. I think this suggestion requires more work on side of definitions, but it could be viable?

m-mohr commented 1 year ago

The aim was to keep the specification slim. A whole new set of CRUD endpoints for process namespaces is not necessarily lightweight, especially if you factor federation in.

But federation was not actually thought of when we were defining this. Back in the days, it was simply a 1:1 mapping on a single back-end so this approach made sense. By the way, /processes/{namespace} is just part of an old PR as of now, it's not yet part of the API at all (except for the namespace attribute in process graphs itself).

Self-defining namespaces could still lead to conflicts in the aggregator though (IDs as well, but less likely). Currently, many people for example are still wokrking directly on VITO and if they define something that already has been defined on EODC for example, you get a conflict in the aggregator (assuming public access).

Open-EO / openeo-api

Allowed characters: User IDs and namespaces #478