Pull out models in dcpy

fvankrieken commented 3 months ago

Most of this is commit 1

Organization

These modules are listed in order of availability. So lifecycle can reference connectors, which can reference utils, and all can reference models

models

Definition of simple types. The logic here is that we can separate out these sort-of data/metadata definitions so that they can be referenced without worrying about circular references. Specifically - when we dump config.json in edm-recipes, this is a model that really is specific to extract - it's the definition of a dataset being ingested/extracted into our pipelines. connectors.edm.recipes is involved with putting and fetching these, but really is a level down, logic-wise. Our librarian (see connectors section below). So this way, recipes connector can at least know that it's returning an extract object, without really knowing what extract really is (and not being able to access any of the actual extract code. Just model definitions)

utils

Fairly self-explanatory. More pure code

connectors

I think also decently self explanatory. Apis for interacting with entities, like our recipes data lake, edm-publishing, third party apis, etc. @alexrichey has described the thought of librarian code - each connector being able to push data, pull data, move data around. Doesn't need to know entirely the specifics of what's being moved, just how it's being moved

lifecycle

Business-logic! The highest-level definition of our processes that manage the "lifecycle" of our data products (and input datasets). Currently just consists of build/extract

library

Its own submodule until it gets put out to pasture

Thoughts on naming

lifecycle

This was @alexrichey 's thought. I like it - thinking about the lifecycle of our data. Other random thoughts (not that I think these are better - just want to give fodder for brainstorming)

mart (just kidding)
business
operations
orchestration

lifecycle.extract

Replacing library. I liked extract at first, until I made a file within it called ingest that was really doing, very specifically, extraction. Really, this code is meant to do a few things

extract data from source - that is, quite literally, fetch it based on how its source and file format are defined. I would sort of like to call this subcomponent "extract"
convert to parquet (a transformation of sorts, also a "load" of sorts in some way (getting into a consistent analytical format)
transform data, with limited scope - reproject, clean data. Calling this "preprocessing" for now
archive in s3 (a load of sorts)

Overall, I do feel like "ingest" describes this most completely, since the word "extract" is more specific to pulling something out of something else, and "archive" is just that, the act of archival. Where "ingest" means to take in. However, extract is a bit more of an industry standard term.

Possible candidates for this module

extract
ingest
archive
library
collect/collection

Organization of models

Right now, this fairly simply follows the organization of the rest of the modules. This makes sense - there are models specific to certain connectors. But there are also likely more "generic" types that can live at the top level. Open to brainstorming here

fvankrieken commented 2 months ago

Other commits:

Use enums in pydantic until serialization. There's some boilerplate here, at some point I want to put more time into figuring out how to do this a bit more gracefully, and make this behavior default for some class we define that extends BaseModel. But I'll save that for a follow up. I'm very happy about this functionality - we never have to stop and think about whether an enum is in fact an enum or its value in python, it's simply an enum. And then it's serialized to its string. Seems much more optimal to me than treating it as string upon object creation
update refs to dcpy
"
comment out something in extract while I figure out how to handle sources/connectors more gracefully
rename file metadata in extract to be more relevant - it's not really parquet metadata, it simply is a description of the format of the input file. This is fairly relevant to extract code, so maybe it can stay there, but it is a tad more general so I moved it up for now

fvankrieken commented 2 months ago

Work moving forwards (after this PR) will largely be focused on moving some logic from extract to connectors (hence the name of this branch). But that seems a bit distinct from the meat of this, and this would be nice to get merged if we want to move in this direction

alexrichey commented 2 months ago

Love it! This all feels like the right direction. Re: the question of extract vs ingest, I would think that ingest is the combination of three steps: extract, transform, and archive. So maybe it makes sense to have that as the top level?

I think my only nit is maybe having lifecycle.build (singular) rather than builds? There were a few things that seems like they could be moved to different spots (e.g. Socrata servers) but I don't think we need to nail that down now.

This type of refactoring also makes me quite glad to have such a nice test suite

fvankrieken commented 2 months ago

Love it! This all feels like the right direction. Re: the question of extract vs ingest, I would think that ingest is the combination of three steps: extract, transform, and archive. So maybe it makes sense to have that as the top level?

I think my only nit is maybe having lifecycle.build (singular) rather than builds? There were a few things that seems like they could be moved to different spots (e.g. Socrata servers) but I don't think we need to nail that down now.

This type of refactoring also makes me quite glad to have such a nice test suite

The builds thing annoys me too, but build is a special python folder generally so i think we'd see issues (it's gitignored right now)

fvankrieken commented 2 months ago

@sf-dcp would still like your input, but going to get this merged for now, and we can iterate on how we want to architect extract/ingest code

sf-dcp commented 2 months ago

Great reorg, @fvankrieken! Now it feels intuitive where everything lives. Agree on moving from extract to ingest. Don't love having builds instead of build, would be nice to replace

NYCPlanning / data-engineering