Closed fvankrieken closed 8 months ago
Rescoping this issue a bit per discussion on 2/28.
For now, this python code should live in dcpy/extract
(maybe in file called ingest.py
if we don't feel like it's a loaded term). The logic in this file should
source
sectionraw_datasets
(don't love, but need something just for the sake of writing out). So for bpl_libraries, a dump would be in edm-recipes/raw_datasets/bpl_libraries/{timestamp}/{filename}. For datasets where we're not getting a file per se (socrata, or json response), probably default to
{datasetname}.{extension}. I think we'll want to do a
latest` folder, though not 100% sure.dcpy.connectors.edm.recipes.dataset
)This will obviously take a revamping of the source
version of a dataset, including getting more specific than "path" and revising what is coming from a "script". These roughy include
edm-inbox
. @damonmcc thoughts?For all of these, we need to think about what validations happen and what inputs are needed (and how they can actually be ingested, of course. S3 can use our utils, web file can use requests, etc). My personal thought would be to find an example of each source type and create a pared down template for use when developing, that we can attempt to reconcile with library templates later.
Potentially, version should also be determined at this step too (and logged with say a config.json
like in data library). For #396 datasets from GIS, it seems like it's easier to try to figure this out at same time as archival rather than later
There's a bit of this which could overlap with some of #498, mainly in that some of this will end up in some metadata changes. But I think that there are some specific metadata tweaks that align well with this change, while #498 is meant to be a bit more comprehensive and expansive, as opposed to a tweak of what we currently have.
I see a few components of this
This maybe should be holstered until we fully move to parquet, just for cleanliness. Work doesn't need to wait though, just fully operationalizing