Closed alexrichey closed 2 weeks ago
Nice write-up! Thank you for doing it. Quick q: we can still use draft/
dir for DE internal testing, correct? And when we intend to build-build, we will be using publish/
?
I'm also a bit unclear about GH issues... It sounds like the workflow will be as follows:
PLUTO v24.1
PLUTO v24.1 1-initial build
(child issue) --> We add tag Ready
--> GIS added to the issue --> GIS updates the child issue's tag In review
:
Passed
for the child issueFailed
for the child issue --> They note required changes in the parent issue --> we repeat the process for generating consecutive child issues. Does that sound right?
Nice write-up! Thank you for doing it. Quick q: we can still use
draft/
dir for DE internal testing, correct? And when we intend to build-build, we will be usingpublish/
?
Thank you, and yes. /draft
would be for DE only.
I'm also a bit unclear about GH issues... It sounds like the workflow will be as follows: ... Does that sound right?
@sf-dcp Yes - thank you, I was a little light on the details around the GIS workflow. Your distillation is very clear though - will copy that over to the main description above. That's exactly what I'd envisioned.
@sf-dcp Ah, the one difference I'd envisioned is that GIS would note required changes in the child issue, not the parent. Updated main text with your example flow.
I really have basically no notes on this write-up. This all sounds great, and makes complete sense for how we generally interact with these once they're in this review state. My main note is that GIS support in terms of data access is a bit of a footnote, and that it really needs to be a key part of this, we can't have a result be that they have to struggle manually with a new folder structure.
Counterpoint would be to just keep "latest" for now. But seems like while we're touching all this stuff will be the best time to get work on a longer-term solution
@alexrichey This all sounds great!
Regarding what we call the datasets in our publishing folders, is there a better term than
draft publications
?publication drafts
? They really are just drafts of what we'll eventually publish.
I guess if these don't live in the DE-only draft
folder, I'm hesitant to call them any type of draft
revisions
? Also builds
is the term we seem to naturally use when we're talking. so maybe draft builds
, publication builds
?
It seems like the verb "publish" here means "DE declares a certain build is ready for QA." So once a build passes QA, which verbs are next. I guess "package and "distribute"?
I really like "stage" as a verb for "DE declares a certain build is ready for QA", but wouldn't mind "publish" if we're happy with it only meaning that.
Yeah, here "publish" means something like "internally published by DE." I'll need to think more about this terminology. Suggestions welcome. "Staged" is sensible, esp in the implication that not everything that's been staged will make it into production. It does have mostly connotations with other systems (ie Git, CI/CD) and I'm not sure how I feel about that.
Might also be good to distance from staging in that it has a very specific meaning for datasets in edm-publishing/datasets
- staging for qa a bit more specifically in web apps, a step down the line from GIS
Unless we want to align those!
I hate to say it, but I think the term we're actually looking for here is draft
. So perhaps what we want is:
drafts/
, we should save them in a builds/
folder. build
into a draft
. (this creates the Github Issue via automation)draft
s will have draft version
s (e.g. 3-add-extra-correction) and will never be overwritten.publish
folder)very down to use draft
for a build
we've promoted for QA
when it passes QA, do we then promote a draft
to a release
?
I hate to say it, but I think the term we're actually looking for here is
draft
. So perhaps what we want is:
I also hate to agree but I think you're right. We've sort of circled around to the original "staging" idea, we're just adding something before "draft" instead of after.
So in this, "builds" are very disposable (good), but what about "drafts"? To both of your last points, do we clean them out after promoting a draft? (to publish
, promote
, or release
folder.) Or is the final draft exactly that, and we distribute from there?
There's something nice about drafts being more permanent - we can always look back at what happened throughout the course of QA. At least for some period of time (6 months? Year? 2 further publications? Maybe just forever, our s3 costs are pretty cheap). And while part of me likes the simplicity of the final draft being the real "final draft", promoting in some way still has a certain amount of clarity that makes things like GIS endpoints simple. A counterpoint to that though is this latest "republishing" of pluto - I'm not sure (in general) how we'd best want to capture a republishing (and redistribution) in this scheme
And while part of me likes the simplicity of the final draft being the real "final draft", promoting in some way still has a certain amount of clarity that makes things like GIS endpoints simple. A counterpoint to that though is this latest "republishing" of pluto - I'm not sure (in general) how we'd best want to capture a republishing (and redistribution) in this scheme
I guess if our act of promoting a "final drat" generates metadata somewhere and we later have to promote a new draft and overwrite the thing at the final endpoint, we'll have a record of the republishing/redistribution
Hard to say where the best place for that record would be though (ignoring a DB option for now since we use json files which I still really like). Probably shouldn't be a file in a draft
folder, seems like everything in there should be the result of upstream operations on build
s.
Maybe a file in the publish
folder? To have a running log of of all published drafts?
do we clean them out after promoting a draft?
I was thinking no - a draft
being the immutable thing that we hand to GIS for QA, which is linked to a Github Issue.
A counterpoint to that though is this latest "republishing" of pluto
I think you're right. The /publish
folder feels like the quickest path to marking drafts as official. The other option is some database solution, or using GH issues (a poor man's database, I guess)
But using /publish
, say we re-publish PLUTO, and each cycle has a few rounds of QA. The drafts folder might look like:
db-pluto/drafts/24v2
3-corrections (QA Pass) --> this one is published to db-pluto/publish/24v2/1
A month later:
So we end up with every intermediate in /drafts
, and two versions in /publish
That all sounds great to me
@alexrichey
when there are multiple subfolders in /publish
, how should we distribute? default to the highest number?
and sounds like these /publish
folders will replace our idea of an edm-distributions
bucket? or should that still be where we distribute from by copying from /publish
? the latter seems nice so that we can have an endpoint that doesn't have these multi-version subfolders
@damonmcc In this scheme, we're still going to keep the packaging folder under the product. So to package, you'll specify a version (and potentially sub-version) of a dataset, (e.g. PLUTO 24v2 version 2) that'll get dropped into a /package
folder with the same version. Then distribution will point at /package
. Unless we just want to package everything up under publish.
And yes, this does mean we get rid of edm-distributions.
One other consideration is that in the case of republishing, we'd want to add some versioning scheme. Maybe something like 24v2-r2. (not something we have to figure out here)
Problem Statement
When DE has finished a build, we've often encountered some combination of the following problems:
Proposed Solution
For all of our products, we should add a subfolder under the version to indicate the
draft publication version
. The current state looks like this:dataset files
draft publication version
/dataset files
The
draft publication version
will be composed of an integer version, and a summary to describe the the build, similar to the summary line of a git commit. A list of builds versions could look like this:I suggest an integer version instead of a timestamp because we don't really care when the draft was published, whereas the integer corresponds to something that we do care about. e.g. if we're in round three of PLUTO publishing, and you see that the last draft publication is
6-fix-the-issue
then you immediately know something is wrong.Draft Publication Github Issues
Our Publishing Github Action will create a Github Issue for every published build version. Decisions, discussions, etc should be documented on that issue. They should all be linked back to a parent Issue for a build of a dataset.
The Issue for the draft publication should use Github Labels to indicate the status. A list of statuses might be:
Perhaps we can auto-add all of GIS as an Assignee
Implementation Details (Technical)
publish
folder to this new scheme.Publish
action to accept achanges summary
field, which will be used to generate thedraft publication version
. The integer part will be inferred from existing versions on DO.dcpy
: Thedraft publication version
concept needs to be added to the edm publish connector.Publish
functionality should refuse to overwrite existing data.Implementation Details (Nontechnical)
draft
folder should be considered deletable.Other Considerations
latest
folders. It's a convenient hack, but has the liability of being potentially out of sync with actual latest versions. As part of this, we could help GIS migrate off. We could either supply them python code to infer last build version, or add a REST endpoint to the QAQC app to redirect to the DO location.edm-publishing
/db-pluto
/23v2
which should be (presumably) under the draft folder.draft publications
?publication drafts
? They really are just drafts of what we'll eventually publish.Example Workflow for GIS (Copied from @sf-dcp's comment)
Suppose we're building PLUTO v24.1.
Ready
and GIS team is added to the issueWhiteboarding