CLARIAH Requirements for infrastructure and Software/Services

proycon commented 3 years ago

In this requirements branch we're working on requirements for the CLARIAH infrastructure and for Software/Services, as described in issue #4. This pull requests tracks all changes. Feel free to comment either here or in issue #4. Very specific comments on the textual contents are more are suited to be placed here, whereas generic observations are more suites for issue #4.

Feel free to simply push to this branch with your contributions (or an extra pull request if you prefer).

Note: this pull request is a work in progress (draft) tracking the relevant changes and should not be merged until ready.

ddeboer commented 3 years ago

Note: this pull request is a work in progress tracking the relevant changes, DO NOT MERGE until the WIP marker is removed from the subject.

You can convert this to a draft PR to prevent it from getting merged accidentally.

proycon commented 3 years ago

You can convert this to a draft PR to prevent it from getting merged accidentally.

Ha thanks, that was the option I was looking for but couldn't find :) I was already surprised and wondering whether github had it at all (in gitlab it is much easier to find).

proycon commented 3 years ago

@ddeboer Thanks for the feedback, I have processed your suggestions.

proycon commented 3 years ago

This pull request should be ready enough for review now I think (still keeping draft status though). Feedback from everyone would be much appreciated.

jblom commented 2 years ago

@proycon, all: about the required maintainer. We could also include a recommendation for providing a CODEOWNERS file. What do you think? https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners

ddeboer commented 2 years ago

I agree with @jblom that we should require the maintainer to be documented. As we already require CodeMeta, and that has a maintainer predicate, we way want to prefer that over GitHub’s CODEOWNERS file. Of course we can still recommend that source code that is hosted on GitHub additionally has the CODEOWNERS file, because that offers some nice features such as auto-requesting PR reviewers.

proycon commented 2 years ago

@jblom @ddeboer Good points.. Does the above commit address this sufficiently?

jblom commented 2 years ago

@proycon @ddeboer yes looks great now with all points addressed included

proycon commented 2 years ago

So this requirement makes a lot of sense for applications geared towards academic users. But what about applications with other types of users? Case in point: NDE’s Solid Collection Registration System, in which small heritage institutions manage their collections? What would be the added value for this type of applications in using SATOSA rather than directly connecting to some external IdP such as Auth0?

So should we:
generalise this requirement to be about OIDC-compatibility on the part of the application, so it can be connected to any OIDC IdP?
or document more clearly how to use CLARIAH auth with non-academic users (I guess BenG is an example of this)?

@ddeboer Very good point, I completely agree and have tried to raise this issue before: non-academic users that don't yet have an account in the federated authentication infrastructure need to be able to get (immediate) access as well. This is in fact part of the reason we held on to our own legacy registration system for the tools & services in Nijmegen. If we opt for a single CLARIAH-wide identity provider, I think it must have the additional option for new users to register (and immediately or after simple mail verification have an active account, that is, not hindered by a human in the loop). If participating services want to exclude such non-academic users, they can still do so based on authorization details. I don't know who's making the decisions for this, perhaps @janpieterk and @mmisworking can tell more?

menzowindhouwer commented 2 years ago

Non-academic users can use the CLARIN IdP, and register here: https://user.clarin.eu/user/register

proycon commented 2 years ago

Non-academic users can use the CLARIN IdP, and register here: https://user.clarin.eu/user/register

Thanks for the quick reply! I indeed knew about that one, but I think it had a human in the loop that needs to verify the registration right? (at least it used to be that way when I registered long ago, and the text still hints at it: After your registration is processed (normally within two working days)). I think such a delay is not acceptable if a user wants to use a service, users expect to immediately use it or they lose interest and leave. (We used to have a similar verification stage in Nijmegen and got rid of it for the same reason). I'd do it the other way round, give users access immediately after registration, but notify a human to keep an eye on registrations and revoke permissions (and possibly set IP bans etc) if needed.

menzowindhouwer commented 2 years ago

We can propose this approach to CLARIN.eu, maybe they are willing to switch to this model. Can you propose it to accounts@clarin.eu, I think @dietervu also listens to that one ...

roelandordelman commented 2 years ago

For Media Suite the CLARIN Idp route is often used for temporary login in Media Suite for users without a university account. However, apart from the manual step at CLARIN, we also have to whitelist the person with the CLARIN account. In practice this manual operation is not a problem. In fact, in my opinion it would often be a requirement. For example, at NISV only scholars are allowed access to NISV data. Therefore the CLARIN users should belong to an "academic" user group. Also, we only provide temporary access via CLARIN (by enabling removal after a certain period of time). Another example, at other collection owners access could be granted to individual users and individual collections, e.g., known professor X is allowed to access collection Y. Whether there is a CLARIN idp of something else, there will be a need for manually assigning access levels to individuals based on their credentials (and/or even membership of a CLARIAH organisation?). So I would not be in favour of the model @proycon describes (turning it around). Ideally, a request for accessing a collection from a non-academic user should be distributed via a CLARIAH wide service to a local operator at a collection owner that checks the request and grants it or not based on pre-defined criteria (established in a large agreement). Some of these criteria may be handled automatically though. For example, members of a organisation that is a memeber of CLARIAH but not academic are granted access automatically via their iDP. I think it is also key that access is blocked/granted at exactly the right spot. E.g., NISV metadata can be searched via media Suite by anyone, viewing content or analysing metadata via Jupyter Notebooks is restricted. So the blocking/grating part should be placed at the viewing level (or environments where users are working with JNs), as is the case in WP5 currently.

proycon commented 2 years ago

Yes, I completely understand the need for proper checks, especially in case of sensitive data, but as you said, there is a need for manually assigning access for such services anyway. I'm not saying we shouldn't do that, it's just that in some cases you might not want it and right now that's impossible. So my concern is with the services that don't need much authorization but only need some simple authentication.

For example, the CLST RUN services attract a fair amount of outside non-academic users, including private individuals and even commercial parties, who just want to try out the service. They are mostly processing services (like the ASR) and users bring their own data so we don't have much to protect there. So as long as the demand on our resources is pretty insignificant, we're fine with anyone trying our services. When people come in hordes and overwhelm the servers we'd probably reconsider ;) but right now we're happy with every user that finds our stuff useful. An activation barrier with human verification would hinder people to try out the service (people have short attention spans and lose interest quickly anyway, I'd do the same).

I suppose if CLARIAH/CLARIN doesn't provide this function, which is fair enough of course, the other option is that we have is to rely on an extra identity provider to accommodate such users, or setting up and managing our own one. That does pose some extra technical challenges and the user will then have to explicitly choose whether to use the CLARIN IdP or whatever else we provide.

Btw, the discussion deviates a bit from @ddeboer 's original point (I guess we should have made a separate issue), which was whether we want to require all services in our infrastructure to use the CLARIAH/CLARIN authentication service (which I think we do). Adding additional IdPs would not violate this either (but if it can be avoided it's have my preference as it's simpler to implement).

proycon commented 2 years ago

Is the path docs/requirements/ still accurate after the move to the clariah-plus repo? We now have top-level use-cases/ so should we make requirements/ top-level too?

No, this will have to be resolved/rebased when we merge this into the main branch. Perhaps the time has come to accept these proposals in the main branch and continue work from there, they have been open long enough and discussion seems to have stagnated a bit. What do you think? (also @roelandordelman)

ddeboer commented 2 years ago

Yeah, let’s merge this and do any follow-up work in subsequent PRs.

proycon commented 2 years ago

This PR is now merged, see the contents here https://github.com/CLARIAH/clariah-plus/tree/main/requirements

CLARIAH / clariah-plus

CLARIAH Requirements for infrastructure and Software/Services #5