Integration with GCRcatalogs

stuartmcalpine commented 7 months ago

Can we replace aspects of GCRcatalogs with the registry?

For example, people would register the GCRcatalogs configuration files into the registry rather than hard coding them into the GCRcatalogs code.

Can we scrape these config files in a useful way to make ingesting them a bit easier?

JoanneBogart commented 6 months ago

I'm starting to formulate a design for this. But first I'd like to see changes in the way the production dataset is created and handled. It's a separate issue, but related in that dealing with these legacy datasets will be our first serious use of production. I think we need another db account, say production_rw, which is used to create the production db and is the only account to have write privileges there. reg_reader - and perhaps also reg_writer - should be given read access at creation time, just as reg_reader is given read privileges now.

At NERSC, production datasets will be under the existing shared area, /global/cfs/cdirs/lsst/shared root_dir for those accessing only the production db would be/global/cfs/cdirs/lsst/shared. I guess we could make a symlink /global/cfs/cdirs/lsst/shared/production -->/global/cfs/cdirs/lsst/shared For a non-production database, if it has default root_dir =/something/root_dirthere needs to be a symlink /somthing/root_dir/production-->/global/cfs/cdirs/lsst/shared/ (This is particularly clumsy because there already is a/global/cfs/cdirs/lsst/production. And although there isn't currently a/global/cfs/cdirs/lsst/shared/production, it doesn't seem quite right for us to usurp this path just for the registry. Maybe we should change the owner_type name -- or at least the corresponding subdirectory name -- fromproductionto, e.g.dataregistry_production` to avoid name conflicts)

Assuming something like the above has been done, adding a new dataset for an existing dataset known to GCRCatalogs is straightforward for "simple" (explained below) catalogs: call dataset.register as usual with old_location set to None, name equal to the GCRCatalogs name (basename of its config file, not include ".yaml"), and access_API set to GCRCatalogs. Value for access_API_configuration for existing datasets is deducible. Values for some other parameters (including at least relative_path and description might be retrievable from the config, but there is no guarantee since there is hardly anything fixed about the format of a GCRCatalogs config file. Since GCRCatalogs has no uniform way to specify dataset version, I think we (that is, a function called something like dataset.register_gcr_catalog ) can by default set version string to, e.g. '1.0.0' but allow the caller to override. By a "simple" catalog I mean one for which its config has a value which corresponds to relative_path. There are other kinds of catalogs: catalogs based on another config (e.g., only including a subset of the data), catalogs which are aliases for some other catalogs, and catalogs which are composites, essentially joining two or more simple catalogs. For the aliases we can probably just use our dataset_alias table. A catalog "based on" another catalog should be tractable. I'm not sure about the composites. Unfortunately for us they're quite useful so we'll have to come up with something. Ideally GCRCatalogs.load_catalog(catalog_name) should then be able to look up the registry entry, retrieve the contents of access_API_configuration and go on its merry way.

JoanneBogart commented 6 months ago

Noting that, as @stuartmcalpine suggested, some notion of "collection", similar to the concept in Rucio, may help with composite catalogs.

JoanneBogart commented 6 months ago

@yymao your thoughts on this issue would be most welcome!

JoanneBogart commented 5 months ago

To handle the various forms of catalog references GCRCatalogs supports we should add a couple features to dataregistry:

A way for dataset aliases to reference other aliases. We need another foreign key column in dataset_alias and some enhancements to the code registering dataset aliases. Each alias will refer either to a dataset or to another alias.
A way to make an entry in dataset which does little more than store the GCRCatalogs config file. Settinglocation_type to "dummy" might do.

JoanneBogart commented 1 month ago

Pretty much everything was addressed by PR #133

LSSTDESC / dataregistry

Integration with GCRcatalogs #106