Closed alexrichey closed 1 week ago
It's also worth noting the functionality in edm-data-operations repo - this scrapes all datasets in edm-publishing/datasets
and looks for differences between staging and production folders, and then has functionality for them to promote datasets from staging to production (and off they go from there to wherever)
- Data Engineering should move their products under
/datasets
I like it. And to @fvankrieken's mention of the edm-data-operations
repo, maybe we could use that repo for the GIS QA issues proposed in RFC: DE<>GIS Dataset Review.
If we did that, we could standardize how all datasets produced by GDE are QAd!
But if it helps us start the migration, maybe we could put all of ours in a new /de_datasets
Also, one wrinkle... I suppose ours are products
, and what's under datasets are actual datasets
. (In the sense that a product
can contain multiple datasets
.
- To keep the folder structure consistent, existing version folders under a product should be moved under a
package
folder, leaving (typically) three total top-level folders:staging
,production
,package
.
this feels like a mix of GIS, DE, and AE things that might by tough to all have at the same level, but good to all be in a dataset's folder. also leaves out draft
?
Also, one wrinkle... I suppose ours are
products
, and what's under datasets are actualdatasets
. (In the sense that aproduct
can contain multipledatasets
.
which of our products have multiple datasets? I'm reluctant to lean into the word product
or say DE's outputs are in a class of their own. when we talk about these things that we all make we say datasets
(CPDB, PLUTO, MIH, etc.), doesn't seem so bad to align with how we talk about them
update: PLUTO Change File is an example of a distinct dataset within the PLUTO product
Before any final decisions are made I'd like both GIS team and DE to sit down and walk through the different proposals. I think the /datasets folder should remain for just the final, post-QA's data that is used for publishing. If that ends up including a package folder for each dataset within/datasets or we create a completely new space (original proposal) and eventually point AE to the equivalent staging/production folders there can be decided at our meeting.
Closing after discussion with GIS + DE.
We're going to add a new folder to DE called product_datasets
, which is where we'll put all packaged files for all datasets distributed to Socrata. The existing datasets
folder is really more of an eventual distribution target (and really should be thought of as something like application_datasets
for our purposes)
The new structure will be, "product_datasets" / {product} / "package" / {version} / {dataset} / dataset files
Like so, using Facilities and LION as examples.
cc @damonmcc @fvankrieken
The current structure of
edm-publishing
is to have Data Engineering's products at the top level. Then GIS stores some of theirs underneath/datasets
, e.g./datasets
is mostly used as a place for Application Engineering to grab files, hence theproduction
andstaging
folders underneath the products. Those are the only two folders which are used by AE, though. All of the version folders are for historic purposes only.I'm suggesting a few things: 1) Data Engineering should move their products under
/datasets
2) To keep the folder structure consistent, existing version folders under a product should be moved under apackage
folder, leaving (typically) three total top-level folders: staging, production, package. 3) Certain datasets are duplicated. e.g.db-
prefix from our products.There is a problem of terminology. In our new vocabulary, something like
Pluto
isn't a dataset, it's a product. In fact,Pluto
has multiple datasets (the main output, the corrections, etc.), each of which has multiple data_files. So it feels awkward to put these under adatasets/
folder when what we actually want isproducts/
Post-Edit: However, all this might take a while, as we start implementing distribution from
packaging
folders, I think it makes most sense to distribute from:package
folders underneath Data Engineering products. e.g.db-facilities/package/