NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

proto-RFC: S3 folder structure on edm-publishing #807

Closed alexrichey closed 1 week ago

alexrichey commented 3 weeks ago

The current structure of edm-publishing is to have Data Engineering's products at the top level. Then GIS stores some of theirs underneath /datasets, e.g. image

/datasets is mostly used as a place for Application Engineering to grab files, hence the production and staging folders underneath the products. Those are the only two folders which are used by AE, though. All of the version folders are for historic purposes only.

I'm suggesting a few things: 1) Data Engineering should move their products under /datasets 2) To keep the folder structure consistent, existing version folders under a product should be moved under a package folder, leaving (typically) three total top-level folders: staging, production, package. 3) Certain datasets are duplicated. e.g.

There is a problem of terminology. In our new vocabulary, something like Pluto isn't a dataset, it's a product. In fact, Pluto has multiple datasets (the main output, the corrections, etc.), each of which has multiple data_files. So it feels awkward to put these under a datasets/ folder when what we actually want is products/

Post-Edit: However, all this might take a while, as we start implementing distribution from packaging folders, I think it makes most sense to distribute from:

  1. package folders underneath Data Engineering products. e.g. db-facilities/package/
  2. the version folders in /products/ for GIS data.
fvankrieken commented 3 weeks ago

It's also worth noting the functionality in edm-data-operations repo - this scrapes all datasets in edm-publishing/datasets and looks for differences between staging and production folders, and then has functionality for them to promote datasets from staging to production (and off they go from there to wherever)

damonmcc commented 3 weeks ago
  1. Data Engineering should move their products under /datasets

I like it. And to @fvankrieken's mention of the edm-data-operations repo, maybe we could use that repo for the GIS QA issues proposed in RFC: DE<>GIS Dataset Review.

If we did that, we could standardize how all datasets produced by GDE are QAd!

But if it helps us start the migration, maybe we could put all of ours in a new /de_datasets

alexrichey commented 3 weeks ago

Also, one wrinkle... I suppose ours are products, and what's under datasets are actual datasets. (In the sense that a product can contain multiple datasets.

damonmcc commented 3 weeks ago
  1. To keep the folder structure consistent, existing version folders under a product should be moved under a package folder, leaving (typically) three total top-level folders: staging, production, package.

this feels like a mix of GIS, DE, and AE things that might by tough to all have at the same level, but good to all be in a dataset's folder. also leaves out draft?

damonmcc commented 3 weeks ago

Also, one wrinkle... I suppose ours are products, and what's under datasets are actual datasets. (In the sense that a product can contain multiple datasets.

which of our products have multiple datasets? I'm reluctant to lean into the word product or say DE's outputs are in a class of their own. when we talk about these things that we all make we say datasets (CPDB, PLUTO, MIH, etc.), doesn't seem so bad to align with how we talk about them

update: PLUTO Change File is an example of a distinct dataset within the PLUTO product

croswell81 commented 2 weeks ago

Before any final decisions are made I'd like both GIS team and DE to sit down and walk through the different proposals. I think the /datasets folder should remain for just the final, post-QA's data that is used for publishing. If that ends up including a package folder for each dataset within/datasets or we create a completely new space (original proposal) and eventually point AE to the equivalent staging/production folders there can be decided at our meeting.

alexrichey commented 1 week ago

Closing after discussion with GIS + DE.

We're going to add a new folder to DE called product_datasets, which is where we'll put all packaged files for all datasets distributed to Socrata. The existing datasets folder is really more of an eventual distribution target (and really should be thought of as something like application_datasets for our purposes)

The new structure will be, "product_datasets" / {product} / "package" / {version} / {dataset} / dataset files

Like so, using Facilities and LION as examples. image

cc @damonmcc @fvankrieken