Closed fvankrieken closed 2 months ago
Other commits:
Work moving forwards (after this PR) will largely be focused on moving some logic from extract
to connectors
(hence the name of this branch). But that seems a bit distinct from the meat of this, and this would be nice to get merged if we want to move in this direction
Love it! This all feels like the right direction. Re: the question of extract vs ingest, I would think that ingest
is the combination of three steps: extract, transform, and archive. So maybe it makes sense to have that as the top level?
I think my only nit is maybe having lifecycle.build (singular) rather than builds? There were a few things that seems like they could be moved to different spots (e.g. Socrata servers) but I don't think we need to nail that down now.
This type of refactoring also makes me quite glad to have such a nice test suite
Love it! This all feels like the right direction. Re: the question of extract vs ingest, I would think that
ingest
is the combination of three steps: extract, transform, and archive. So maybe it makes sense to have that as the top level?I think my only nit is maybe having lifecycle.build (singular) rather than builds? There were a few things that seems like they could be moved to different spots (e.g. Socrata servers) but I don't think we need to nail that down now.
This type of refactoring also makes me quite glad to have such a nice test suite
The builds
thing annoys me too, but build
is a special python folder generally so i think we'd see issues (it's gitignored right now)
@sf-dcp would still like your input, but going to get this merged for now, and we can iterate on how we want to architect extract/ingest code
Great reorg, @fvankrieken! Now it feels intuitive where everything lives. Agree on moving from extract
to ingest
. Don't love having builds
instead of build
, would be nice to replace
Most of this is commit 1
Organization
These modules are listed in order of availability. So
lifecycle
can referenceconnectors
, which can referenceutils
, and all can referencemodels
models
Definition of simple types. The logic here is that we can separate out these sort-of data/metadata definitions so that they can be referenced without worrying about circular references. Specifically - when we dump config.json in edm-recipes, this is a model that really is specific to
extract
- it's the definition of a dataset being ingested/extracted into our pipelines.connectors.edm.recipes
is involved with putting and fetching these, but really is a level down, logic-wise. Our librarian (see connectors section below). So this way,recipes
connector can at least know that it's returning anextract
object, without really knowing whatextract
really is (and not being able to access any of the actualextract
code. Just model definitions)utils
Fairly self-explanatory. More pure code
connectors
I think also decently self explanatory. Apis for interacting with entities, like our recipes data lake, edm-publishing, third party apis, etc. @alexrichey has described the thought of librarian code - each connector being able to push data, pull data, move data around. Doesn't need to know entirely the specifics of what's being moved, just how it's being moved
lifecycle
Business-logic! The highest-level definition of our processes that manage the "lifecycle" of our data products (and input datasets). Currently just consists of build/extract
library
Its own submodule until it gets put out to pasture
Thoughts on naming
lifecycle
This was @alexrichey 's thought. I like it - thinking about the lifecycle of our data. Other random thoughts (not that I think these are better - just want to give fodder for brainstorming)
lifecycle.extract
Replacing
library
. I liked extract at first, until I made a file within it calledingest
that was really doing, very specifically,extraction
. Really, this code is meant to do a few thingsOverall, I do feel like "ingest" describes this most completely, since the word "extract" is more specific to pulling something out of something else, and "archive" is just that, the act of archival. Where "ingest" means to take in. However, extract is a bit more of an industry standard term.
Possible candidates for this module
Organization of models
Right now, this fairly simply follows the organization of the rest of the modules. This makes sense - there are models specific to certain connectors. But there are also likely more "generic" types that can live at the top level. Open to brainstorming here