NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Output folders, dev environments, etc #98

Closed fvankrieken closed 8 months ago

fvankrieken commented 10 months ago

@alexrichey and I had a bit of a chat today and figured we should start logging thoughts and figure out what we want long term.

Discussion stemmed from questions around logic which currently uses git cli to get branch to use as export folder for build, and how to make this work both in CI as well as local. @alexrichey brought up point that we could move to making these explicit environment variables - in my .env I could have output_folder=fvankrieken or if I really wanted something more fine-grained output_folder=fvankrieken_incorporate_pff etc. But essentially offering more control and decoupling from git slightly (and being able to push to "main" without running code on main, and potentially use inputs in CI for this as well.

@damonmcc curious if you have immediate thoughts

fvankrieken commented 10 months ago

Ties into similar thoughts around schemas/dbs too

fvankrieken commented 10 months ago

Bit of discussion in pff pr here

damonmcc commented 10 months ago

glad to have an issue for this! some thoughts:

alexrichey commented 10 months ago

I like all these ideas, especially the default of {dataset}/{branch}/{version}/ except with some other concept subbed in for branch. @damonmcc 's suggestion of build_environment makes a lot of sense to me.

Also, happy with the current compromise in which we let folks insert intermediate paths to do something like {dataset}/{build_environment}/ {your/weird/path/here} /{version}/ though maybe we lock that down for certain build_environments like latest or staging.

alexrichey commented 10 months ago

Also, if we extract the path code out of dcpy.edm.publishing.upload, that would be a very self-contained, unit testable bit of code (with some fun edge cases!) we could hand over to @DeaBardhoshi or @athursland

damonmcc commented 9 months ago

noting some links shared in the DE teams channel

we should definitely consider this in out storage/versioning discussions https://docs.digitalocean.com/products/spaces/how-to/versioning/ https://docs.aws.amazon.com/AmazonS3/latest/userguide/versioning-workflows.html