NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Environment norms/standardization #18

Closed fvankrieken closed 9 months ago

fvankrieken commented 1 year ago

Another brainstorming issue. Prior to our monorepo, we were generally moving towards dev containers. Do we want one main dev container that we use for all parts of the repo? This also then ties in to environments for github actions - are we pulling from a custom built docker image? Running setup.sh scripts?

Some thoughts 1 vs multiple dev container definitions

Pros of 1 dev container definition/Cons of multiple:

Cons (or some pros of having multiple container definitions)

Random thoughts around answers

@damonmcc

fvankrieken commented 1 year ago

And on top of that, a bunch of small important details

fvankrieken commented 1 year ago

Couple tangential notes on gdal -

fvankrieken commented 1 year ago

Series of issues at end of this PR sort of summarize issue

https://github.com/NYCPlanning/data-engineering/pull/15

Attempting to get fgdb working for pluto - could use a base image and install gdal from command line, but more manual/longer process than using docker image. But the published docker images don't come with postgres and other utils, so going in some circles around what things I'm choosing an image with vs what I'm installing vs command line. A little bit of a me issue here, I'll get it settled out and it'll be fine. But still, thinking that we should just publish some images that we like and try to use just those with minimal configuration outside of those image definitions

damonmcc commented 1 year ago

noting that Postgres 14 may be the best unifying version (link)

fvankrieken commented 1 year ago

Random small note that I think makes sense to add here, on the topic of environment and mc setup

Simplest way to handle .env stuff in dev container might just be using envfile option in docker compose. Then devcontainer.json could specify a "setup.sh" script to run at startup (like we've often done) which adds mc spaces using the secrets. I would say we then have the convention that "setup.sh" is something that's supposed to be called both at dev container startup and as a first step of any cloud build step

Side note, in that docker link they mention secrets as being a more appropriate way to store secrets than an envfile/env variables. So maybe worth going that route - just seems a tad more manual (expects secrets to be declared individually, with values being in their own files), so would need to work on ergonomics - maybe generating files from .env as a "pre-build script" or whatever the term is in devcontainer.json

damonmcc commented 12 months ago

noting some thoughts on the various docker images we maintain and where they are located (link)

damonmcc commented 11 months ago

noting that Postgres 14 may be the best unifying version (link)

our persistent DB called EDM-DATA appears to be Postgres 15.3. I checked by running the query select version();. I'm not aware of this causing any issues when changing a build script from using a temporary Postgres 14 DB to using EDM-DATA, but seems good to know

if we eventually want/need to do it, Max recently rolled the version of EDM-DATA forward and back and a few times I remember it not being too bad

fvankrieken commented 10 months ago

This issue is maybe a bit out of date - we could create a new issue for this, but last outstanding issue for near future in my mind is different python environments per recent back and forth on PR, brought up by @alexrichey.

Basically, would be good to start specifying multiple sets of requirements for different environments

Anything else? Should also think a bit about pinning versions. Ideally, we pin versions across these environments for consistency of running code, but then subset for installation. Not sure easiest way to do this, or if there's any machinery that we should make around this.

A related note - at this point, are docker images still the way we want to go with this? There's a nice simplicity for use, but just want to check in if anyone has other thoughts

damonmcc commented 9 months ago

@fvankrieken related to the docker maintenance work, is now a good time to create a Dockerhub account using the DataEngineering_DL email? I think we're still using former employees' credentials to push images and such

fvankrieken commented 9 months ago

We are, but it'd have to be associated with the nycplanning org - I've asked @AmandaDoyle for details on that

fvankrieken commented 9 months ago

Also remember that we have a new one - ITD-GDE-DE_DL

damonmcc commented 9 months ago

🎊🎊🎊