NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

DevDB Continued Refresh #485

Closed fvankrieken closed 4 months ago

fvankrieken commented 4 months ago

Relatively straightforward. Waiting @damonmcc's version pinning of devdb tests

Commits

  1. Move majority of version.env vars into recipe, and grab previous version during planning
  2. update readme
  3. use parquet for a single dataset
  4. more error handling in looking up past versions
alexrichey commented 4 months ago

Nice, though I do wonder if we're conflating template variables and env variables. Wondering if we might either: 1) just set the recipe's env variables into the environment at the beginning of a build. Might makes sense to have some dcpy functionality to set env vars, for various reasons 2) or just call these vars instead of env, and don't bother putting them in the os.environ

In any case, I do really like capturing vars in the recipe, esp if we upload the recipe file to edm-publishing, just to make it explicit what went into the build.

Thoughts @damonmcc @sf-dcp ?

fvankrieken commented 4 months ago

Nice, though I do wonder if we're conflating template variables and env variables. Wondering if we might either:

  1. just set the recipe's env variables into the environment at the beginning of a build. Might makes sense to have some dcpy functionality to set env vars, for various reasons
  2. or just call these vars instead of env, and don't bother putting them in the os.environ

In any case, I do really like capturing vars in the recipe, esp if we upload the recipe file to edm-publishing, just to make it explicit what went into the build.

Thoughts @damonmcc @sf-dcp ?

Yeah separating env and template vars seems like a good distinction. Especially since env often contains secrets - not that bash env vars are making it into the recipe's env field, but just seems good to keep that distinction explicit and make sure nobody makes some unfortunate assumptions

fvankrieken commented 4 months ago

Nice, though I do wonder if we're conflating template variables and env variables. Wondering if we might either:

  1. just set the recipe's env variables into the environment at the beginning of a build. Might makes sense to have some dcpy functionality to set env vars, for various reasons
  2. or just call these vars instead of env, and don't bother putting them in the os.environ

In any case, I do really like capturing vars in the recipe, esp if we upload the recipe file to edm-publishing, just to make it explicit what went into the build. Thoughts @damonmcc @sf-dcp ?

Yeah separating env and template vars seems like a good distinction. Especially since env often contains secrets - not that bash env vars are making it into the recipe's env field, but just seems good to keep that distinction explicit and make sure nobody makes some unfortunate assumptions

There are multiple cases where "input" vars still need to be made part of the bash env for the rest of the build, so maybe we should move all of that to template vars, and then export the entire vars field, rather than the targeted way it's being done for version and version_prev for devdb in the github action in this pr (and pluto as well, though currently no changes in this PR)

alexrichey commented 4 months ago

@fvankrieken I kinda lean towards

so maybe we should move all of that to template vars, and then export the entire vars field

Just for consistency across the entire build process. But I don't have a strong opinion, and I'm happy if you want to get this in and we can let it percolate.

fvankrieken commented 4 months ago

Going to make that env/var tweak, then re-run devdb build before merging

fvankrieken commented 4 months ago

build all on the latest commit of this branch. FacDB is failing due to change in schema of source dataset that @sf-dcp is working on.

I also left out FacDB of commit 6 because I didn't want to collide with anything @sf-dcp is doing