NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
23 stars 0 forks source link

Data Library - Archive Raw Data #499

Closed fvankrieken closed 8 months ago

fvankrieken commented 10 months ago

There's a bit of this which could overlap with some of #498, mainly in that some of this will end up in some metadata changes. But I think that there are some specific metadata tweaks that align well with this change, while #498 is meant to be a bit more comprehensive and expansive, as opposed to a tweak of what we currently have.

I see a few components of this

This maybe should be holstered until we fully move to parquet, just for cleanliness. Work doesn't need to wait though, just fully operationalizing

fvankrieken commented 8 months ago

Rescoping this issue a bit per discussion on 2/28.

For now, this python code should live in dcpy/extract (maybe in file called ingest.py if we don't feel like it's a loaded term). The logic in this file should

  1. Take in a yml template (from library), specifically the source section
  2. Based on source type, dump raw data. This should be in in a folder in edm-recipes, say we decide on raw_datasets (don't love, but need something just for the sake of writing out). So for bpl_libraries, a dump would be in edm-recipes/raw_datasets/bpl_libraries/{timestamp}/{filename}. For datasets where we're not getting a file per se (socrata, or json response), probably default to{datasetname}.{extension}. I think we'll want to do alatest` folder, though not 100% sure.
  3. Return the path to this raw data, be it a string s3 path or some sort of object (a la dcpy.connectors.edm.recipes.dataset)

This will obviously take a revamping of the source version of a dataset, including getting more specific than "path" and revising what is coming from a "script". These roughy include

For all of these, we need to think about what validations happen and what inputs are needed (and how they can actually be ingested, of course. S3 can use our utils, web file can use requests, etc). My personal thought would be to find an example of each source type and create a pared down template for use when developing, that we can attempt to reconcile with library templates later.

fvankrieken commented 8 months ago

Potentially, version should also be determined at this step too (and logged with say a config.json like in data library). For #396 datasets from GIS, it seems like it's easier to try to figure this out at same time as archival rather than later