Closed fvankrieken closed 1 month ago
TODO before review
I need to add tests for commit 8 (pd Series logic), but I don't anticipate needing to change non-test code as I've done a lot of use-case testing. Given that, would appreciate input and review!
I have currently managed to cover our use cases (I think!), but I also think it's perfectly fine to have to have a specific preprocessing function called process_cofos
or something like that if need be
I have currently managed to cover our use cases (I think!), but I also think it's perfectly fine to have to have a specific preprocessing function called
process_cofos
or something like that if need be
Gotcha - where does that type of thing live?
I have currently managed to cover our use cases (I think!), but I also think it's perfectly fine to have to have a specific preprocessing function called
process_cofos
or something like that if need beGotcha - where does that type of thing live?
Could be in the preprocessor object along with the other functions. Not the most elegant for now
Or a function there that points to other dcpy code
Resolves #633 follows up on #763
Bit of context first
Motivation/Scope
763 gave a very basic framework for defining preprocessing steps in yml and validating/running them in python. Pandas was chosen as the tool for transformations here because of ease of data cleaning relative to sql, though I could be convinced to go a different route. The current framework is this
Preprocessor
class iningest.transform
that defines the valid preprocessing functions to call.For this PR, I wanted to take most/all of the library datasets that have custom "script" sources (which in reality have very normal sources much like our other datasets - api calls, local files, etc), but actually rely on their "source" script to perform some sort of preprocessing.
The first 6 commits here are more generic ingest stuff, code cleanup, etc - then after that, each commit is adding one or multiple templates from library to ingest, and adding the appropriate yml and new preprocessors as needed to correctly spit out what's needed. Going commit-by-commit will definitely be easiest, and I think I've sorted out most of the cases where I might be overwriting changes in future commits.
Most of the logic is concerned with
dcpy.lifecycle.ingest.transform
- the bottom function reads in the initial parquet file as a dataframe, then applies preprocessing steps as defined in yml input. There are some that are a little more "extract"-focused as well.Happy to sit down and chat about anything if it'd be helpful
Commit summaries
General
Path
objects don't serialize nicely by default as it turns outrun.py
), and the other functions require it as an arg. Just with testing, running it myself, this seemed to make more sense. would love thoughts around this, as it maybe is still a tad clunkytransform
- mainly around what arguments are geting shuttled along, taking paths as inputs rather than returning as outputs. Fairly in line with commit 3ingest.configure
where it doesn't belongisinstance
function. isinstance doesn't work for parameterized generics, and there are a few that would be nice to check - literals, types of lists/dicts, etc. Needed custom line for Union too, becauseisinstance
does work on them, but wouldn't work if one of the types is a parameterized genericingest
. That ideally should also validate all processing steps defined. Moved this since we're getting a little out of the configure logicdataset-specific
Links here will be to python script in
library/script
generallyappend_prev
to actually work. Additionally added the generic pd.Series function. Essentially, allow users to specify a string like "map" or "str.split" that is a callable series function (df["column"].str.split(...)
), and kw arguments to said function. Also adds logic to validate calls to this function. Due to weirdness around what inspect returns for these functions, I needed to rework the validation function a bit. But all tests passing. For these functions, type cannot be validated but at least the arguments (presence of, whether there missing) is validatedsql
specified in gdal options, so that logic needs to be recreated. Tweaks rename_columns and adds a function to clean column names