Open charles-turner-1 opened 3 days ago
@marc-white @rbeucher what are your opinions on this?
Sure, we could add a placeholder for user-built catalogs.
What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog.
What do you think?
The main issue I see is that we have to 'pre-can' all of the information about the user's potential catalog - what guarantee do we have that this catalog information will match whatever the user comes up with?
Secondly, how would a user build their catalog? Would we need to provide updates to the existing access_nri_intake_catalog
build scripts to add a "custom version" option?
Thirdly, is this actually necessary to do within the access_nri_intake_catalog
ecosystem? If we have users who are advanced enough to be able to generate their own 'catalog of catalogs', either using a tool we provide or manually via intake
, couldn't they load their custom catalog directly, rather than give it an access_nri
alias?
What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog.
I'm not sure this is technically feasible - the catalog content is read from the metacatalog.csv
file, so we'd have to either find some way to 'modify' that in memory, or find some other way to append additional rows to an existing catalog.
The main issue I see is that we have to 'pre-can' all of the information about the user's potential catalog - what guarantee do we have that this catalog information will match whatever the user comes up with?
I was envisaging a situation where the user would populate user_def
section - we'd just be providing them an entry point to access this catalog that is distinct from the default access_nri
one. I'm pretty sure that in this use case, this shouldn't be an issue. Effectively, we would just implement the changes in #244 and then direct the user with how to populate this entry point with data.
Secondly, how would a user build their catalog? Would we need to provide updates to the existing
access_nri_intake_catalog
build scripts to add a "custom version" option?
I think we would jut leave it up to the user to build their catalog however they see fit - eg. modifying build_all.sh
etc. in order to generate a catalog. The aim of this would just be to allow them to swap back and forth between catalogs easily, and ideally helping avoid accidental use of the wrong catalog. I was fiddling with intake_dataframe_catalog this morning & didn't realise I still had ~/.access_nri_intake_catalog/catalog.yaml
set. I can see this becoming a footgun.
Thirdly, is this actually necessary to do within the
access_nri_intake_catalog
ecosystem? If we have users who are advanced enough to be able to generate their own 'catalog of catalogs', either using a tool we provide or manually viaintake
, couldn't they load their custom catalog directly, rather than give it anaccess_nri
alias?
Yeah, this is a really good point. Perhaps it would be better to direct users to load custom catalogs with intake.open_dataframe_catalog(...)
somewhere in the docs.
What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog.
I'm not sure this is technically feasible - the catalog content is read from the
metacatalog.csv
file, so we'd have to either find some way to 'modify' that in memory, or find some other way to append additional rows to an existing catalog.
I think it might actually be plausible to do this - I think we would just have to update the intake dataframe catalog driver to support multi-file catalogs. This would look something like
sources:
access_nri:
args:
columns_with_iterables:
- model
- realm
- frequency
- variable
mode: r
name_column: name
path:
- /g/data/xp65/public/apps/access-nri-intake-catalog/{{version}}/metacatalog.csv
- $MY_EPHEMERAL_CATALOG.csv
yaml_column: yaml
description: ACCESS-NRI intake catalog
driver: intake_dataframe_catalog.core.DfFileCatalog
metadata:
storage: gdata/fs38+gdata/oi10+gdata/tm70
version: '{{version}}'
parameters:
version:
default: v0.1.3
description: Catalog version
type: str
Probably it would be quite a bit more involved than that to actually implement, but I think it should be doable.
I was envisaging a situation where the user would populate
user_def
section - we'd just be providing them an entry point to access this catalog that is distinct from the defaultaccess_nri
one. I'm pretty sure that in this use case, this shouldn't be an issue. Effectively, we would just implement the changes in #244 and then direct the user with how to populate this entry point with data.I suppose we could tell the user to grab the 'real'
catalog.yaml
, put it in their home area, then populate it with their catalog info under theuser_def
heading? We can't let them populate the livecatalog.yaml
onxp65
, otherwise that will affect everyone.
Yup, this is what I had in mind.
I think we would jut leave it up to the user to build their catalog however they see fit - eg. modifying
build_all.sh
etc. in order to generate a catalog. The aim of this would just be to allow them to swap back and forth between catalogs easily, and ideally helping avoid accidental use of the wrong catalog. I was fiddling with intake_dataframe_catalog this morning & didn't realise I still had~/.access_nri_intake_catalog/catalog.yaml
set. I can see this becoming a footgun.Yes, the ghost local catalog concern did make me think. Do you think it's worth throwing a warning of some kind if we load the local catalog, rather than the real? That would at least slap most users in the face and remind them.
Yeah, absolutely.
Thirdly, is this actually necessary to do within the
access_nri_intake_catalog
ecosystem? If we have users who are advanced enough to be able to generate their own 'catalog of catalogs', either using a tool we provide or manually viaintake
, couldn't they load their custom catalog directly, rather than give it anaccess_nri
alias?Yeah, this is a really good point. Perhaps it would be better to direct users to load custom catalogs with
intake.open_dataframe_catalog(...)
somewhere in the docs.
I think that approach would minimize confusion between the 'real' catalog and the user's own Frankenstein's monster version, especially once users start sharing with each other (see below).
What about adding a temporary source location to the main catalog at runtime? I’m thinking of our discussion with the Ocean team today—it sounds like their new experiments will have intake-ESM catalogs. It would be good to integrate these dynamically into the main catalog. ... I'm pretty convinced this isn't a great idea. Consider the following situation:
- Researcher creates custom add-on catalog that gets patched into
intake.cat.access_nri
- Researcher generates a Jupyter notebook to do some analysis on their conjoined catalog
- Researcher hands Jupyter notebook down to PhD student to work on, but (because they're a dotty researcher-type like me) forget that they have a custom catalog squashed into
intake.cat.access_nri
- PhD student gets an 'experiment not found' error from the notebook, student pings us asking why an experiment is missing from the canonical catalog
- We spend an age trying to work out why something that was in catalog isn't any more, until we figure out it was never in the catalog to begin with
Much better, I think, to keep a clear delineation between what is canonically in (and, by exclusion, what isn't in)
intake.cat.access_nri
.
Yeah, thats an excellent point.
Is your feature request related to a problem? Please describe.
This builds on the solution to #191 in #243.
With the changes introduced by #243, users are able to build & query their own catalogs by placing a catalog file in
$HOME/.access_nri_intake_catalog/catalog.yaml
, and this will be preferentially loaded over the default catalog at/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml
.This default catalog looks something like (NB. using the old version numbering)
and a user defined catalog will look something like
where $DIR is a directory the user has placed their metacatalog in.
These changes represent a big step forward in terms of the users ability to use bespoke catalogs. However, the architecture of intake is such that presently, if a user wished to compare catalogs/data obtained from catalogs, it would be necessary to:
$ mv ~/.access_nri_intake_catalog/catalog.yaml ~/._access_nri_intake_catalog/catalog.yaml
.This might create issues for users who wish to compare their custom catalog with the default catalog, and it can be made easier.
Describe the feature you'd like
Intake allows a single catalog to describe multiple sources: ie, the two catalogs above could be combined as
This would then allow the user to perform the following operations:
Doing so requires an additional entry point for the user_def catalog, & so we would additionally require the following changes in
pyproject.toml
:and in
src/access_nri_intake/data/__init__.py
Entry points are created at package build time & fixed, so realistically we would probably have to limit users to a single user defined catalog, unless we figure out a way to do some black magic to circumvent that limitation.
Describe alternatives you've considered
Leave as is - this might be an unnecessary addition.
Additional context
See #244 for sample implementation.