Closed stuartmcalpine closed 1 month ago
I'm starting to formulate a design for this. But first I'd like to see changes in the way the production dataset is created and handled. It's a separate issue, but related in that dealing with these legacy datasets will be our first serious use of production. I think we need another db account, say production_rw, which is used to create the production db and is the only account to have write privileges there. reg_reader - and perhaps also reg_writer - should be given read access at creation time, just as reg_reader is given read privileges now.
At NERSC, production datasets will be under the existing shared area,
/global/cfs/cdirs/lsst/shared
root_dir for those accessing only the production db would be/global/cfs/cdirs/lsst/shared
. I guess we could make a symlink
/global/cfs/cdirs/lsst/shared/production -->
/global/cfs/cdirs/lsst/shared For a non-production database, if it has default root_dir =
/something/root_dirthere needs to be a symlink
/somthing/root_dir/production-->
/global/cfs/cdirs/lsst/shared/ (This is particularly clumsy because there already is a
/global/cfs/cdirs/lsst/production. And although there isn't currently a
/global/cfs/cdirs/lsst/shared/production, it doesn't seem quite right for us to usurp this path just for the registry. Maybe we should change the owner_type name -- or at least the corresponding subdirectory name -- from
productionto, e.g.
dataregistry_production` to avoid name conflicts)
Assuming something like the above has been done, adding a new dataset for an existing dataset known to GCRCatalogs is straightforward for "simple" (explained below) catalogs: call dataset.register as usual with old_location
set to None, name
equal to the GCRCatalogs name (basename of its config file, not include ".yaml"), and access_API set to GCRCatalogs. Value for access_API_configuration
for existing datasets is deducible. Values for some other parameters (including at least relative_path
and description
might be retrievable from the config, but there is no guarantee since there is hardly anything fixed about the format of a GCRCatalogs config file.
Since GCRCatalogs has no uniform way to specify dataset version, I think we (that is, a function called something like dataset.register_gcr_catalog ) can by default set version string to, e.g. '1.0.0' but allow the caller to override.
By a "simple" catalog I mean one for which its config has a value which corresponds to relative_path
. There are other kinds of catalogs: catalogs based on another config (e.g., only including a subset of the data), catalogs which are aliases for some other catalogs, and catalogs which are composites, essentially joining two or more simple catalogs. For the aliases we can probably just use our dataset_alias
table. A catalog "based on" another catalog should be tractable. I'm not sure about the composites. Unfortunately for us they're quite useful so we'll have to come up with something.
Ideally GCRCatalogs.load_catalog(catalog_name)
should then be able to look up the registry entry, retrieve the contents of access_API_configuration
and go on its merry way.
Noting that, as @stuartmcalpine suggested, some notion of "collection", similar to the concept in Rucio, may help with composite catalogs.
@yymao your thoughts on this issue would be most welcome!
To handle the various forms of catalog references GCRCatalogs supports we should add a couple features to dataregistry:
dataset_alias
and some enhancements to the code registering dataset aliases. Each alias will refer either to a dataset or to another alias.dataset
which does little more than store the GCRCatalogs config file. Settinglocation_type
to "dummy" might do.Pretty much everything was addressed by PR #133
Can we replace aspects of GCRcatalogs with the registry?
For example, people would register the GCRcatalogs configuration files into the registry rather than hard coding them into the GCRcatalogs code.
Can we scrape these config files in a useful way to make ingesting them a bit easier?