Formally adopting Software & Infrastructure requirements (request for comment)

proycon commented 2 years ago

Last year we started formulating software & infrastructure requirements for CLARIAH. We'd like to take this to the next phase now and adopt these requirements and actively prescribe them for all our tool providers, so this will have a direct impact on what tools can be part of the CLARIAH infrastructure. The requirements are meant to provide clarity for all and facilitate interoperability. This decision will be made by the Technical Committee (or possibly also the CLARIAH board), but before that we want to make sure we all agree on the contents. We therefore want give everybody the renewed opportunity to comment on the current texts, feel free to suggest any amendments. We plan to discuss these software & infrastructure requirements in a session on the next Tech Day (May 25th), so please prepare any feedback you may have by then, ideally by simply responding to this issue.

Additionally, the FAIR Tool Discovery track is producing "Software Metadata Requirements" on which we also want to request feedback (although this is not entirely complete yet and a work in progress). The FAIR Datasets track (and IG Curation) may also come with some data-specific requirements at a later stage.

The two main documents under discussion:

The additional metadata requirements still in development:

Software Metadata Requirements

jblom commented 2 years ago

@proycon @ddeboer I've read through the software and service requirements and infra requirements once again and the only thing that I'm not really sure about is why all configuration MUST be in environment variables

I take it, this is taken from: https://12factor.net/config

Organizing configurations in e.g. .env files is not as convenient as using e.g. yml files or something else that supports more structure. Also the risk of committing .env files to GH is just as high as committing a config file.

Also I don't really see how env variables solve the problem with differences (of configs) between deployments/environments:

Either one deploys to environment x with different values for the app's env vars OR one deploys to environment x with different values of the app's config file... (seems the same to me)

Note that I assume the settings in the config are always the same regardless of each environment/deployment, it's just the values of those settings that MIGHT be different in different deployments/environments

proycon commented 2 years ago

Good point. That's indeed something that might require some further clarification and may be worded a bit too strongly as it is currently. I think the most important point here is that the configuration MUST be separated from the code base itself, and MUST be separated from the container image as whole (i.e. the container can run be deployed in multiple places rather than needing to be rebuild a container for every specific deployment). Any deployment-specific details can be easily tweaked by settings environment variables, as these are very easy to pass to a container.

Organizing configurations in e.g. .env files is not as convenient as using e.g. yml files or something else that supports more structure.

I see your point yes and I agree that these environment variables are not a substitute for more structured configuration files. There may be a fair degree of complexity in application-specific configuration files, the idea is definitely not to translate all of that to plain environment variables and do away with the configuration files entirely; they exist for good reasons after all. But the idea is just to have some sort of universal abstraction for the main deployment-specific parameters, where environment variables are like the simplest solution common-shared-solution that offers a high level of granularity.

You can of course simply pass an entire configuration file to the container at run-time, with the configuration in yml,json,toml, xml, ini or whatever the application prefers. The disadvantage here is loss of granularity. If you set for instance a database password in a yaml file, then you have to change the entire yaml file. If you have that database password defined at some central place in your infrastructure (because multiple applications share the same pw for instance), you ideally want it propagated to that yaml file automatically so you only have to define it once and not duplicate it. So the ideal solution would be to build your final yaml configuration using a simple templating mechanism at container run-time (something as simple and lightweight as envsubst will do the job).

Either one deploys to environment x with different values for the app's env vars OR one deploys to environment x with different values of the app's config file... (seems the same to me)

Yes, but what if a value is shared between environments and applications? It's a matter of granularity, and also a matter of providing 'operators' with a single paradigm to specify configuration key/value pairs, an abstraction over the actual configuration files.

Does this make some sense? I'll think about how we can clarify this in the text.

@ddeboer I also wonder what your perspective on this is. (same for @mmisworking but I think he's on holiday).

proycon commented 2 years ago

@jblom @ddeboer I just proposed a change to the text in https://github.com/CLARIAH/clariah-plus/pull/103

ddeboer commented 2 years ago

@proycon That absolutely makes sense: while env vars are not ideal for highly structured (hierarchical) configuration values, they offer a universal interface (enabling decoupling) between infrastructure on the one hand and application software on the other. If applications depend on specific configuration file structures, that distinction gets muddled.

@jblom Thanks for your feedback. Just to be clear: only configuration values that differ between environments (such as database connection secrets) need to be provided by the infrastructure. Configuration that is environment-independent can just be checked into your software application repository.

proycon commented 2 years ago

Some discussion points raised by @tvermaut:

There is some inconsistency in the vocabulary used for the two documents, the software requirements strictly use RFC2119, the infrastructure requirements are worded slightly different. (due to the the different origins of the documents. This needn't be a problem, though.
SR 13.7 (regarding privacy and trackers) may conflict with IR-23 that weakly suggests CDNs, I propose scrapping the point about CDNs from the IR (that's more of a software thing anyway).
There's a fair degree of duplication between the documents (on purpose), but perhaps some of the points in SR-2 have a bit too much overlap with the more specific software metadata requirements that is in a seperate document.

proycon commented 1 year ago

Closing this now, these have been adopted by the board a while back.

CLARIAH / clariah-plus

Formally adopting Software & Infrastructure requirements (request for comment) #102