NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Expand ingest preprocessing for script source coverage #774

Closed fvankrieken closed 1 month ago

fvankrieken commented 2 months ago

Resolves #633 follows up on #763

Bit of context first

Motivation/Scope

763 gave a very basic framework for defining preprocessing steps in yml and validating/running them in python. Pandas was chosen as the tool for transformations here because of ease of data cleaning relative to sql, though I could be convinced to go a different route. The current framework is this Preprocessor class in ingest.transform that defines the valid preprocessing functions to call.

For this PR, I wanted to take most/all of the library datasets that have custom "script" sources (which in reality have very normal sources much like our other datasets - api calls, local files, etc), but actually rely on their "source" script to perform some sort of preprocessing.

The first 6 commits here are more generic ingest stuff, code cleanup, etc - then after that, each commit is adding one or multiple templates from library to ingest, and adding the appropriate yml and new preprocessors as needed to correctly spit out what's needed. Going commit-by-commit will definitely be easiest, and I think I've sorted out most of the cases where I might be overwriting changes in future commits.

Most of the logic is concerned with dcpy.lifecycle.ingest.transform - the bottom function reads in the initial parquet file as a dataframe, then applies preprocessing steps as defined in yml input. There are some that are a little more "extract"-focused as well.

Happy to sit down and chat about anything if it'd be helpful

Commit summaries

General

  1. Bit of a bug - for consistency, simplest to just move a local file into the tmp/staging/etc dir that external files get downloaded to
  2. Also bug - Path objects don't serialize nicely by default as it turns out
  3. Cleanup of logic around where file outputs are in extract. Currently for simplicity/flexibility, each big step is dumping out a parquet file. There was too much assumed logic around where these files were. I move the default location setting logic to be at runtime (in run.py), and the other functions require it as an arg. Just with testing, running it myself, this seemed to make more sense. would love thoughts around this, as it maybe is still a tad clunky
  4. Bunch of little tweaks to transform - mainly around what arguments are geting shuttled along, taking paths as inputs rather than returning as outputs. Fairly in line with commit 3
  5. Two things, which are tough to disentangle because I made this one commit instead of 2
    • moving some validation code out of ingest.configure where it doesn't belong
    • a custom recursive isinstance function. isinstance doesn't work for parameterized generics, and there are a few that would be nice to check - literals, types of lists/dicts, etc. Needed custom line for Union too, because isinstance does work on them, but wouldn't work if one of the types is a parameterized generic
  6. There's a test that validates all yml templates in ingest. That ideally should also validate all processing steps defined. Moved this since we're getting a little out of the configure logic

dataset-specific

Links here will be to python script in library/script generally

  1. add template for dcp_sfspd and filter_rows
  2. dob_cofos. This needed use of computed version in jinja template, renaming of columns, setting value of columns, and getting my first pass at append_prev to actually work. Additionally added the generic pd.Series function. Essentially, allow users to specify a string like "map" or "str.split" that is a callable series function (df["column"].str.split(...)), and kw arguments to said function. Also adds logic to validate calls to this function. Due to weirdness around what inspect returns for these functions, I needed to rework the validation function a bit. But all tests passing. For these functions, type cannot be validated but at least the arguments (presence of, whether there missing) is validated
  3. dpr_capitalprojects, nypl_libraries, bpl_libraries. This doesn't have any added preprocessing logic, but adds a basic implementation for reading json source to df. The columns thing feels a little hacky but I was having errors trying to dump the "raw" dataframe to parquet. Which we don't technically need but is still nice just as a troubleshooting help.
  4. dob_now permits and applications. Also did not need scripting logic, but did need an s3 source. This maybe gets replaced with local file upload? If we're archiving raw data, not sure we need to load to private s3 before archiving raw in s3
  5. dob bis applications and permits. These didn't have script sources but were our only two datasets with sql specified in gdal options, so that logic needs to be recreated. Tweaks rename_columns and adds a function to clean column names
  6. fisa commitments and detailed budget data. Needed ability to specify column names when reading in raw data, as well as deduplication logic for the detailed budget data since it appends to prev a la cofos
fvankrieken commented 1 month ago

TODO before review

fvankrieken commented 1 month ago

I need to add tests for commit 8 (pd Series logic), but I don't anticipate needing to change non-test code as I've done a lot of use-case testing. Given that, would appreciate input and review!

fvankrieken commented 1 month ago

I have currently managed to cover our use cases (I think!), but I also think it's perfectly fine to have to have a specific preprocessing function called process_cofos or something like that if need be

alexrichey commented 1 month ago

I have currently managed to cover our use cases (I think!), but I also think it's perfectly fine to have to have a specific preprocessing function called process_cofos or something like that if need be

Gotcha - where does that type of thing live?

fvankrieken commented 1 month ago

I have currently managed to cover our use cases (I think!), but I also think it's perfectly fine to have to have a specific preprocessing function called process_cofos or something like that if need be

Gotcha - where does that type of thing live?

Could be in the preprocessor object along with the other functions. Not the most elegant for now

Or a function there that points to other dcpy code