RFC: DE<>GIS Dataset Review Process + Internal Publishing Versions

alexrichey commented 3 weeks ago

Problem Statement

When DE has finished a build, we've often encountered some combination of the following problems:

It's not clear to GIS (or Amanda) what build of a dataset should be reviewed.
The status of a dataset in the publication lifecycle (being built, awaiting review, in review, approved) isn't clear.
Files for datasets that have been reviewed are potentially (maybe even often) overwritten by subsequent builds.
There's not clear lineage backwards from the published / distributed datasests (e.g. on Socrata / Bytes) back to their raw files.
There's no review history for subsequent builds and reviews of a dataset.
- e.g. say we're in our third QA cycle of PLUTO 24v1.1, where should we look to know why the second build failed QA?

Proposed Solution

For all of our products, we should add a subfolder under the version to indicate the draft publication version. The current state looks like this:

current: edm-publishing / db-pluto / publish / 24v1.1 / dataset files
proposed: edm-publishing / db-pluto / publish / 24v1.1 / draft publication version / dataset files

The draft publication version will be composed of an integer version, and a summary to describe the the build, similar to the summary line of a git commit. A list of builds versions could look like this:

1-initial-build
2-add-new-corrections
3-fix-zoning

I suggest an integer version instead of a timestamp because we don't really care when the draft was published, whereas the integer corresponds to something that we do care about. e.g. if we're in round three of PLUTO publishing, and you see that the last draft publication is 6-fix-the-issue then you immediately know something is wrong.

Draft Publication Github Issues

Our Publishing Github Action will create a Github Issue for every published build version. Decisions, discussions, etc should be documented on that issue. They should all be linked back to a parent Issue for a build of a dataset.

The Issue for the draft publication should use Github Labels to indicate the status. A list of statuses might be:

Ready
In QA
Passed
Failed

Perhaps we can auto-add all of GIS as an Assignee

Implementation Details (Technical)

in DO: Migrate everything in the publish folder to this new scheme.
in GHA: Modify our Publish action to accept a changes summary field, which will be used to generate the draft publication version. The integer part will be inferred from existing versions on DO.
in dcpy: The draft publication version concept needs to be added to the edm publish connector. Publish functionality should refuse to overwrite existing data.

Implementation Details (Nontechnical)

Within this new framework, everything under a products draft folder should be considered deletable.
Need to work with GIS to change procedures.

Other Considerations

I think it's nearly time (if not actually time) to do away with the latest folders. It's a convenient hack, but has the liability of being potentially out of sync with actual latest versions. As part of this, we could help GIS migrate off. We could either supply them python code to infer last build version, or add a REST endpoint to the QAQC app to redirect to the DO location.
We could certainly use some utilities around tidying up our edm publications bucket. e.g. CLI tooling to flag or move "misplaced" items, e.g. edm-publishing / db-pluto / 23v2 which should be (presumably) under the draft folder.
Regarding what we call the datasets in our publishing folders, is there a better term than draft publications? publication drafts? They really are just drafts of what we'll eventually publish.
What repo should Draft Publication Issues live in?

Example Workflow for GIS (Copied from @sf-dcp's comment)

Suppose we're building PLUTO v24.1.

DE creates parent issue for a dataset build manually, lets call it PLUTO v24.1. DE kicks off a build.
DE reviews internally, and publishes via the Github action (no changes to process so far)
GHA generates new Issue PLUTO v24.1 1-initial build (child issue), tagged as Ready and GIS team is added to the issue
GIS updates the child issue's tag In review: Their QA review looks good. They put tag Passed for the child issue Their QA review asks for changes --> They put tag Failed for the child issue --> They note required changes in the child issue --> we repeat the process for generating consecutive child issues.

Whiteboarding

sf-dcp commented 3 weeks ago

Nice write-up! Thank you for doing it. Quick q: we can still use draft/ dir for DE internal testing, correct? And when we intend to build-build, we will be using publish/?

sf-dcp commented 3 weeks ago

I'm also a bit unclear about GH issues... It sounds like the workflow will be as follows:

Create parent issue for a dataset build manually, lets call it PLUTO v24.1
Automatically generated issue PLUTO v24.1 1-initial build (child issue) --> We add tag Ready --> GIS added to the issue --> GIS updates the child issue's tag In review:
- Their QA review looks good. They put tag Passed for the child issue
- Their QA review asks for changes --> They put tag Failed for the child issue --> They note required changes in the parent issue --> we repeat the process for generating consecutive child issues.

Does that sound right?

alexrichey commented 3 weeks ago

Nice write-up! Thank you for doing it. Quick q: we can still use draft/ dir for DE internal testing, correct? And when we intend to build-build, we will be using publish/?

Thank you, and yes. /draft would be for DE only.

alexrichey commented 3 weeks ago

I'm also a bit unclear about GH issues... It sounds like the workflow will be as follows: ... Does that sound right?

@sf-dcp Yes - thank you, I was a little light on the details around the GIS workflow. Your distillation is very clear though - will copy that over to the main description above. That's exactly what I'd envisioned.

alexrichey commented 3 weeks ago

@sf-dcp Ah, the one difference I'd envisioned is that GIS would note required changes in the child issue, not the parent. Updated main text with your example flow.

fvankrieken commented 3 weeks ago

I really have basically no notes on this write-up. This all sounds great, and makes complete sense for how we generally interact with these once they're in this review state. My main note is that GIS support in terms of data access is a bit of a footnote, and that it really needs to be a key part of this, we can't have a result be that they have to struggle manually with a new folder structure.

Counterpoint would be to just keep "latest" for now. But seems like while we're touching all this stuff will be the best time to get work on a longer-term solution

damonmcc commented 3 weeks ago

@alexrichey This all sounds great!

Regarding what we call the datasets in our publishing folders, is there a better term than draft publications? publication drafts? They really are just drafts of what we'll eventually publish.

I guess if these don't live in the DE-only draft folder, I'm hesitant to call them any type of draft

revisions? Also builds is the term we seem to naturally use when we're talking. so maybe draft builds, publication builds?

damonmcc commented 3 weeks ago

It seems like the verb "publish" here means "DE declares a certain build is ready for QA." So once a build passes QA, which verbs are next. I guess "package and "distribute"?

I really like "stage" as a verb for "DE declares a certain build is ready for QA", but wouldn't mind "publish" if we're happy with it only meaning that.

alexrichey commented 3 weeks ago

Yeah, here "publish" means something like "internally published by DE." I'll need to think more about this terminology. Suggestions welcome. "Staged" is sensible, esp in the implication that not everything that's been staged will make it into production. It does have mostly connotations with other systems (ie Git, CI/CD) and I'm not sure how I feel about that.

fvankrieken commented 3 weeks ago

Might also be good to distance from staging in that it has a very specific meaning for datasets in edm-publishing/datasets - staging for qa a bit more specifically in web apps, a step down the line from GIS

fvankrieken commented 3 weeks ago

Unless we want to align those!

alexrichey commented 2 weeks ago

I hate to say it, but I think the term we're actually looking for here is draft. So perhaps what we want is:

instead of saving builds in drafts/, we should save them in a builds/ folder.
when ready for QA, we promote/publish a build into a draft. (this creates the Github Issue via automation)
drafts will have draft versions (e.g. 3-add-extra-correction) and will never be overwritten.
We can always assume the draft with the largest version number is what was published. (or we can continue to copy over into the publish folder)

damonmcc commented 2 weeks ago

very down to use draft for a build we've promoted for QA

when it passes QA, do we then promote a draft to a release?

fvankrieken commented 2 weeks ago

I hate to say it, but I think the term we're actually looking for here is draft. So perhaps what we want is:

I also hate to agree but I think you're right. We've sort of circled around to the original "staging" idea, we're just adding something before "draft" instead of after.

So in this, "builds" are very disposable (good), but what about "drafts"? To both of your last points, do we clean them out after promoting a draft? (to publish, promote, or release folder.) Or is the final draft exactly that, and we distribute from there?

There's something nice about drafts being more permanent - we can always look back at what happened throughout the course of QA. At least for some period of time (6 months? Year? 2 further publications? Maybe just forever, our s3 costs are pretty cheap). And while part of me likes the simplicity of the final draft being the real "final draft", promoting in some way still has a certain amount of clarity that makes things like GIS endpoints simple. A counterpoint to that though is this latest "republishing" of pluto - I'm not sure (in general) how we'd best want to capture a republishing (and redistribution) in this scheme

damonmcc commented 2 weeks ago

And while part of me likes the simplicity of the final draft being the real "final draft", promoting in some way still has a certain amount of clarity that makes things like GIS endpoints simple. A counterpoint to that though is this latest "republishing" of pluto - I'm not sure (in general) how we'd best want to capture a republishing (and redistribution) in this scheme

I guess if our act of promoting a "final drat" generates metadata somewhere and we later have to promote a new draft and overwrite the thing at the final endpoint, we'll have a record of the republishing/redistribution

Hard to say where the best place for that record would be though (ignoring a DB option for now since we use json files which I still really like). Probably shouldn't be a file in a draft folder, seems like everything in there should be the result of upstream operations on builds.

Maybe a file in the publish folder? To have a running log of of all published drafts?

alexrichey commented 2 weeks ago

do we clean them out after promoting a draft?

I was thinking no - a draft being the immutable thing that we hand to GIS for QA, which is linked to a Github Issue.

A counterpoint to that though is this latest "republishing" of pluto

I think you're right. The /publish folder feels like the quickest path to marking drafts as official. The other option is some database solution, or using GH issues (a poor man's database, I guess)

But using /publish, say we re-publish PLUTO, and each cycle has a few rounds of QA. The drafts folder might look like:

db-pluto/drafts/24v2

1-initial-build (QA fails)
2-fix-bug (QA fails)
3-corrections (QA Pass) --> this one is published to db-pluto/publish/24v2/1

A month later:
4-redo-something (QA Pass) --> db-pluto/publish/24v2/2

So we end up with every intermediate in /drafts, and two versions in /publish

fvankrieken commented 2 weeks ago

That all sounds great to me

damonmcc commented 2 weeks ago

@alexrichey

when there are multiple subfolders in /publish, how should we distribute? default to the highest number?

and sounds like these /publish folders will replace our idea of an edm-distributions bucket? or should that still be where we distribute from by copying from /publish? the latter seems nice so that we can have an endpoint that doesn't have these multi-version subfolders

alexrichey commented 2 weeks ago

@damonmcc In this scheme, we're still going to keep the packaging folder under the product. So to package, you'll specify a version (and potentially sub-version) of a dataset, (e.g. PLUTO 24v2 version 2) that'll get dropped into a /package folder with the same version. Then distribution will point at /package. Unless we just want to package everything up under publish.

And yes, this does mean we get rid of edm-distributions.

One other consideration is that in the case of republishing, we'd want to add some versioning scheme. Maybe something like 24v2-r2. (not something we have to figure out here)

NYCPlanning / data-engineering