fmigneault commented 1 year ago

Summary

The intent of this issue is to document the recommended approach for future users, such that we reduce the potential combinations of unexpected overrides they might attempt, and they could as us to maintain support.

Description

Using the EXTRA_CONF_DIRS, it is possible to apply additional configurations on top of DEFAULT_CONF_DIRS. The activated components are then the super-set of DEFAULT_CONF_DIRS | EXTRA_CONF_DIRS. However, this still forces the user to have, at least, the DEFAULT_CONF_DIRS set of components enabled. Technically, it would be perfectly possible to override DEFAULT_CONF_DIRS to start the instance with an even smaller subset than the proposed default services.

The recommended approach to do so should be better documented in: https://github.com/bird-house/birdhouse-deploy/blob/2b344d3383d8c0cd3f4cc4a0bf4266e3bae06a47/birdhouse/env.local.example#L45-L47

Also, there might be a need to document a "minimal set" of dependencies (i.e.: what must absolutely be defined for the instance to start without error). Notably, proxy (to have nginx) and "some service" that overrides the root location would be needed. Maybe more? https://github.com/bird-house/birdhouse-deploy/blob/2b344d3383d8c0cd3f4cc4a0bf4266e3bae06a47/birdhouse/env.local.example#L216-L219

This "minimal set" could be simply documented, and it would be up to the new node maintainers to keep them in DEFAULT_CONF_DIRS, or we could be more proactive and make another MINIMAL_CONF_DIRS variable (or some other method...?).

mishaschwartz commented 1 year ago

Some things we've discussed in the past:

it should be possible to have a "data-only" or "compute-only" node if desired
every node should have a catalog (stac) even if it's not hosting any data itself
authentication/authorization should be mandatory (even if the node maintainer decides to open up permissions)

My initial thoughts then is that the following should be part of the minimal components:

proxy
magpie
twitcher
stac

Then we also have the question of cowbird, even though it is not strictly necessary in all cases, it might be worth it to add it to the list just to keep the node set-up consistent. I'm not sure about this though

tlvu commented 1 year ago

Then we also have the question of cowbird, even though it is not strictly necessary in all cases, it might be worth it to add it to the list just to keep the node set-up consistent. I'm not sure about this though

From my limited understanding of what Cowbird does, I think it allows Jupyter users to share notebooks with other Jupyter users via dynamically created symlinks in the recipient side. Therefore, it is not such "generic" component. I might be wrong, please correct me on this.

Back to the primary point about documenting the minimal set, I would just put them in DEFAULT_CONF_DIRS and document how to override it to keep it simple.

Should we finalize the move of everything under config/ to components/ before officially documenting this to avoid incompatible path changing in the config?

fmigneault commented 1 year ago

Cowbird syncs permissions of corresponding resources between services. For example, the wpsoutputs files accessible through THREDDS, the same files downloaded through /wpsoutputs HTTP endpoint, and exposing them to the relevant user-workspace in Jupyter as applicable.

Considering that many services and files were assuming publicly access before, Cowbird didn't accomplish much more, hence why it must have felt not that important so far. However, as soon as you toggle that public switch, everything is not accessible properly if Cowbird is not involved.

fmigneault commented 1 year ago

Back to the primary point about documenting the minimal set, I would just put them in DEFAULT_CONF_DIRS and document how to override it to keep it simple.

Should we finalize the move of everything under config/ to components/ before officially documenting this to avoid incompatible path changing in the config?

Sounds good.

fmigneault commented 1 year ago

Another item discussed in today's executive committee meeting was to consider weaver also in the minimum set that forms a DACCS node.

The reasoning behind this is mostly that it is the only current service in birdhouse-deploy that can perform federated operations, namely, dispatching processing steps to various DACCS nodes in a network, according to the specified data-source URLs.

This goes well with STAC, once https://github.com/bird-house/birdhouse-deploy/pull/297 is completed, which can offer a federated catalogue to search of data over the DACCS network, provided that the STAC populator behind it gradually sync available metadata between the nodes.

It is expected in the long run that any WPS output produced by Weaver (and therefore all other WPS birds of each node since it can wrap their processing monitoring) would be inserted to the STAC catalogue.

mishaschwartz commented 1 year ago

If we say that weaver is required then we're essentially saying that you can't have a data-only node.

I am in favour of saying that weaver is required if there are any wps services so that weaver can wrap their processes.

mishaschwartz commented 1 year ago

Maybe a better way of thinking about this is in terms of component dependencies. Here is a sketch of what I imagine we should do with the stack:

Minimal components required in every deployment:

proxy
magpie/twitcher
stac

Cowbird is required if:

both jupyterhub and weaver are enabled: to create/manage the user workspace
both thredds and geoserver are enabled: to sync permissions

If any WPS or weaver components are enabled, then thredds is required (to serve wps outputs).

If any WPS components are enabled, then weaver is required to wrap their services.

fmigneault commented 1 year ago

both jupyterhub and weaver are enabled: to create/manage the user workspace

Jupyter by itself is sufficient to have cowbird active. It could provide a public wpsoutputs directory from any WPS service.

If any WPS or weaver components are enabled, then thredds is required (to serve wps outputs).

This is not true. The WPS outputs are accessible by themselves on /wpsoutputs. THREDDS adds another way to access them.

mishaschwartz commented 1 year ago

Ok so to update my understanding of component dependencies:

weaver is required if:
- there are any WPS services
cowbird is required if:
- jupyterhub is enabled OR
- thredds and geoserver are enabled
thredds is always optional (nothing depends on thredds)

Am I missing anything?

fmigneault commented 1 year ago

Cowbird could be needed for WPS outputs if WPS services are enabled as well. Possibly also if STAC is enabled to provide more syncs with other services, though it is not yet implemented.

mishaschwartz commented 1 year ago

@fmigneault The STAC one is interesting... what still needs to be implemented to integrate STAC with cowbird?

fmigneault commented 1 year ago

Some of the items published in STAC could be NetCDF files (or others) accessible through THREDDS. Therefore, both access to those files either through STAC API or THREDDS should be aligned.

tlvu commented 1 year ago

cowbird is required if:

* jupyterhub is enabled OR

* thredds and geoserver are enabled

cowbird required if jupyterhub enabled: Agreeded

cowbird required if thredds and geoserver are enabled AND one of the WPS bird is enabled: If no WPS bird, nothing will write to the wpsoutputs/ dir so there is no need to cowbird to sync any permissions.

Basically, data-only node do not need cowbird, unless I still do not fully understand what cowbird does.

mishaschwartz commented 1 year ago

Some of the items published in STAC could be NetCDF files (or others) accessible through THREDDS. Therefore, both access to those files either through STAC API or THREDDS should be aligned.

But it sounds like that if we have a data only node that is serving data through thredds we still need cowbird. And even if it's not serving data through thredds, but is making the data available from the "secure-data-proxy", presumably we'd still want to align permissions between stac and those permissions as well right?

fmigneault commented 1 year ago

cowbird required if thredds and geoserver are enabled AND one of the WPS bird is enabled: If no WPS bird, nothing will write to the wpsoutputs/ dir so there is no need to cowbird to sync any permissions.

Cowbird also needed to sync permissions between GeoServer and THREDDS, even without any WPS.

tlvu commented 1 year ago

Cowbird also needed to sync permissions between GeoServer and THREDDS, even without any WPS.

Oh I missed this. Can you remind me what what GeoServer is trying to access on Thredds and vice-versa?

If only one of GeoServer or Thredds, then no need for Cowbird right?

tlvu commented 1 year ago

thredds is always optional (nothing depends on thredds)

Could we say the same for GeoServer? Don't think something depends on GeoServer.

fmigneault commented 1 year ago

Can you remind me what what GeoServer is trying to access on Thredds and vice-versa?

Some shapefiles/layers that could be shared within user-workspaces. Files under paths defined by this part of the config: https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/components/cowbird/config/cowbird/config.yml.template#L45-L85

If only one of GeoServer or Thredds, then no need for Cowbird right?

Correct, unless some other service needs synchronization such as in https://github.com/bird-house/birdhouse-deploy/pull/360 for WPS outputs, or some other use cases we could come up with (e.g.: MLflow instance in the works with JupyterHub that could need some user-workspace file share as well).

thredds is always optional (nothing depends on thredds)

Could we say the same for GeoServer? Don't think something depends on GeoServer.

I believe this is the case.

mishaschwartz commented 1 year ago

I think we're going in circles here. Let me talk about this a different way:

STAC is always required and if a node provides any data then cowbird is required to synchronize permissions between STAC and the service that provides the data (Thredds, Geoserver, secure-data-proxy)
A node that provides any computation services will either provide:
- jupyterhub (requires cowbird)
- a WPS service or weaver (requires cowbird)

So in every configuration of the node, cowbird is required.

If you can think of a configuration where cowbird is not required, please let us know

fmigneault commented 1 year ago

STAC is always required and if a node provides any data then cowbird is required to synchronize permissions between STAC and the service that provides the data (Thredds, Geoserver, secure-data-proxy)

Yes. but 😅
STAC is not required to be limited to local data. It could technically only contain references to external sources (eg: CMIP6 dataset), which would not require Cowbird since there would not be any synchronization needed.

However, I believe this is an edge case, and it is safe to assume that STAC would also refer to local data provided by another service of the same instance.

Another situation that could make Cowbird unnecessary is if the instance is configured to be fully open with public access and that user workspaces are not used (eg. data only node without JupyterHub). Cowbird would only create redundant permissions for users that already have public access.

Given all that, having Cowbird running in the background shouldn't pose any issue even if those use cases are encountered.

mishaschwartz commented 1 year ago

Yes. but 😅

:laughing:

Given all that, having Cowbird running in the background shouldn't pose any issue even if those use cases are encountered.

Yeah I agree

mishaschwartz commented 1 year ago

Ok so the proposed minimal subset of components required are now:

proxy
magpie
twitcher
stac
cowbird

tlvu commented 1 year ago

Ok so the proposed minimal subset of components required are now:
* proxy

* magpie

* twitcher

* stac

* cowbird

Agreed

mishaschwartz commented 1 year ago

So based on the discussion above, I believe we've decided on the following action items:

move everything under config/ to components/
change the DEFAULT_CONF_DIRS variable to contain: proxy, magpie, twitcher, stac, cowbird

I feel like this should all be done in one PR (or all in the same version update) since this will require all current deployments to make some manual changes to their env.local files

fmigneault commented 1 year ago

change the DEFAULT_CONF_DIRS variable to contain: proxy, magpie, twitcher, stac, cowbird

I'm not quite sure how to handle this one. On one hand, I like the idea of "default" being the minimal set of components that constitute a "valid" node. However, I prefer to preserve backward compatibility, which requires Jupyter to be available (over STAC) to avoid breaking CRIM's, Ouranos', and maybe other servers, that assume this is active by default.

mishaschwartz commented 1 year ago

I think that this is necessarily going to be a breaking change and will require current deployments to update EXTRA_CONF_DIRS to include components that used to be in DEFAULT_CONF_DIRS if they want to keep their deployment as is.

If we want to ease the transition, we can create a migration script that can be run to update env.local files to automatically update the relevant variables. We can even configure it to run as part of pavics-compose.sh so that it doesn't require any user intervention.

tlvu commented 1 year ago

If we want to ease the transition, we can create a migration script that can be run to update env.local files to automatically update the relevant variables. We can even configure it to run as part of pavics-compose.sh so that it doesn't require any user intervention.

But then the migration script will be enabled by default? So by default all the current enabled components will still be enabled even if they are not in DEFAULT_CONF_DIRS anymore?

If each of us, for each existing deployment, we have to manually edit each env.local to activate this migration script then I have a easier proposition:

Given the dir list in EXTRA_CONF_DIRS is not enforced to exist, each of us can edit all the existing env.local in advance and add the new paths. So the day the PR that rename them is merged, all the env.local is already ready.

Basically, instead of editing each existing env.local to activate the migration script, edit each existing env.local to put all the new components names in advance.

That "rename PR" has to wait for each org to approuve saying "I have prepared all my env.local files already" before merging.

mishaschwartz commented 1 year ago

@tlvu

Sure, if you think that it's easier to coordinate all of the existing deployments that works too.

tlvu commented 1 year ago

Sure, if you think that it's easier to coordinate all of the existing deployments that works too.

I think it's just simpler, no migration script to write and same effort of searching and editing all existing env.local of all existing deployments.

Otherwise, if the migration script is activated by default, it means the same components will still be deployed by default, which defeat the purpose of moving them out of DEFAULT_CONF_DIRS in the first place.

To ease further the editing of the various env.local, a comment can be added to env.local.example for the variable EXTRA_CONF_DIRS, listing all dirs that would replicate this old deployment stack.

fmigneault commented 1 year ago

I think that this is necessarily going to be a breaking change

I disagree. This repository is not exclusively for DACCS/Marble nodes. I don't think there is any technical issue in this case that forces us to cause major/breaking changes.

If we want to ease the transition, we can create a migration script that can be run to update env.local files to automatically update the relevant variables.

I don't like this idea. The user should be in control of what they enable. This can cause undesired side-effects.

Given the dir list in EXTRA_CONF_DIRS is not enforced to exist, each of us can edit all the existing env.local in advance and add the new paths. So the day the PR that rename them is merged, all the env.local is already ready.

Expanding on that, I think this is the key.

We only need to make sure that DEFAULT_CONF_DIRS is applied only when EXTRA_CONF_DIRS is not defined or is empty. In other words, export EXTRA_CONF_DIRS="${EXTRA_CONF_DIRS:-${DEFAULT_CONF_DIRS}}" should be set only after parsing env.local.

What we need to watch for is the order defined here: https://github.com/bird-house/birdhouse-deploy/blob/5c06b4bd3a1183bc767bcf1937d54015b25609af/birdhouse/read-configs.include.sh#L252-L254 We must make sure not to resolve export EXTRA_CONF_DIRS="${EXTRA_CONF_DIRS:-${DEFAULT_CONF_DIRS}}" before env.local had the change to be evaluated. Therefore, this definition cannot be directly in default.env.

If evaluated in the right order, existing instances that already override EXTRA_CONF_DIRS will remain intact even with the introduction of this default list of components.

mishaschwartz commented 1 year ago

In other words, export EXTRA_CONF_DIRS="${EXTRA_CONF_DIRS:-${DEFAULT_CONF_DIRS}}" should be set only after parsing env.local.

I don't see how making DEFAULT_CONF_DIRS the default for EXTRA_CONF_DIRS helps in this case.

I think what @tlvu is suggesting is that we just ask everyone to copy of move some lines from DEFAULT_CONF_DIRS to EXTRA_CONF_DIRS before they update to the new version

mishaschwartz commented 1 year ago

Let me give an example so that we're sure we're talking about the same thing...

BEFORE:

export DEFAULT_CONF_DIRS='
  ./config/proxy
  ./config/canarie-api
  ./config/geoserver
  ./config/finch
  ./config/raven
  ./config/hummingbird
  ./config/thredds
  ./config/portainer
  ./config/magpie
  ./config/twitcher
  ./config/jupyterhub
'

export EXTRA_CONF_DIRS='
  ./components/monitoring
  ./components/cowbird
  ./components/weaver
'

AFTER:

export DEFAULT_CONF_DIRS='
  ./config/proxy
  ./config/magpie
  ./config/twitcher
  ./components/stac
  ./components/cowbird
'

export EXTRA_CONF_DIRS='
  ./config/canarie-api
  ./config/geoserver
  ./config/finch
  ./config/raven
  ./config/hummingbird
  ./config/thredds
  ./config/portainer
  ./config/jupyterhub
  ./components/monitoring
  ./components/weaver
'

A deployment that has the configuration in the BEFORE section can manually edit EXTRA_CONF_DIRS so that it looks like the one in the AFTER section without any major change to the services that their deployment offers

tlvu commented 1 year ago

A deployment that has the configuration in the BEFORE section can manually edit EXTRA_CONF_DIRS so that it looks like the one in the AFTER section without any major change to the services that their deployment offers

Exact! That's what I have in mind.

During the same edit of my various existing env.local, I would even also add the ./component variant (ex: ./config/finch, ./component/finch) to be already forward-compatible pour le rename !

I think what @tlvu is suggesting is that we just ask everyone to copy of move some lines from DEFAULT_CONF_DIRS to EXTRA_CONF_DIRS before they update to the new version

Exact. I meant for all orgs to prepare all the various existing env.local config files in advance of the rename, not to change the code.

fmigneault commented 1 year ago

@tlvu
Good idea. I'll update the configs in a similar fashion for CI instances.

fmigneault commented 1 year ago

Backward/forward-compatible config/components location have been applied for CI instances.

bird-house / birdhouse-deploy

Document how to employ a subset of components #357

Summary

Description

BEFORE:

AFTER: