Data Library - Archive Raw Data

fvankrieken commented 10 months ago

There's a bit of this which could overlap with some of #498, mainly in that some of this will end up in some metadata changes. But I think that there are some specific metadata tweaks that align well with this change, while #498 is meant to be a bit more comprehensive and expansive, as opposed to a tweak of what we currently have.

I see a few components of this

Actual archival of raw source data. This involves some decisions. Do we want to store this data in a separate bucket? Different parent folder in recipes, mimicking structure? So recipes/raw/{dataset_name}/{version} and recipes/datasets/{dataset_name}/{version}. Or we could simply put them in the same folders. I don't love that - sometimes I imagine we'll need to get version information from a dataset itself possibly, and I don't like in general running into cases where we have a dataset/version folder that has raw but not processed data for some reason.
- this also gets into "version" vs "date" issue, and maybe we need to think about this a little bit.
- separate buckets is convenient in that we could use the exact same utils for pushing/pulling, just replacing bucket name. But I guess we could do the same thing with parent folder. Maybe we go to two new buckets for this rather than putting them into recipes, and get us out of the "recipe" naming ambiguity (edm-recipes vs recipe.yml)
Breaking up of extract logic a bit. These are sort of mini ETLs in and of themselves, so thinking of the true "extract" part (or first "load" part?) as dumping raw data into s3, then being able to "transform" (clean/preprocess/reproject/etc) this and output a parquet file. This also means updating logic around script sources. We want to move towards a model where when we have custom script "sources", we really have "path" or "socrata" etc sources, and then apply some sort of operators to the dataset rather than the current model which views a python script as a "source" in and of itself

This maybe should be holstered until we fully move to parquet, just for cleanliness. Work doesn't need to wait though, just fully operationalizing

fvankrieken commented 8 months ago

Rescoping this issue a bit per discussion on 2/28.

For now, this python code should live in dcpy/extract (maybe in file called ingest.py if we don't feel like it's a loaded term). The logic in this file should

Take in a yml template (from library), specifically the source section
Based on source type, dump raw data. This should be in in a folder in edm-recipes, say we decide on raw_datasets (don't love, but need something just for the sake of writing out). So for bpl_libraries, a dump would be in edm-recipes/raw_datasets/bpl_libraries/{timestamp}/{filename}. For datasets where we're not getting a file per se (socrata, or json response), probably default to{datasetname}.{extension}. I think we'll want to do alatest` folder, though not 100% sure.
Return the path to this raw data, be it a string s3 path or some sort of object (a la dcpy.connectors.edm.recipes.dataset)

This will obviously take a revamping of the source version of a dataset, including getting more specific than "path" and revising what is coming from a "script". These roughy include

local file (if we want to allow this, as opposed to forcing edm-inbox. @damonmcc thoughts?
web file
s3 file (largely our own datasets and anything from GIS/GR)
api (socrata as an example, though this is really just a web download, but we could use an actual api. Maybe that's v2)
- socrata could be its own case, as it has very specific needed input (four-four id). In general, maybe we'd want any apis defined as their own source type
script - main case here that came up is checkbook, where we scrape an api to get a "complete" dataset as a result of the query
sftp - some of these are manual (require being on citynet), but others such as the pluto inputs currently happen in github actions (and then are geocoded)

For all of these, we need to think about what validations happen and what inputs are needed (and how they can actually be ingested, of course. S3 can use our utils, web file can use requests, etc). My personal thought would be to find an example of each source type and create a pared down template for use when developing, that we can attempt to reconcile with library templates later.

fvankrieken commented 8 months ago

Potentially, version should also be determined at this step too (and logged with say a config.json like in data library). For #396 datasets from GIS, it seems like it's easier to try to figure this out at same time as archival rather than later

NYCPlanning / data-engineering

Data Library - Archive Raw Data #499