Open shippy opened 6 years ago
Interesting... I hadn't considered 1. Do you have any proposed APIs to support splitting the pipeline in two? I'm not quite sure what it would look like...
I did hit pain point 2 when I was using engarde more. Not sure how best to handle it either.
Hm :) Perhaps engarde.decorators.sieve
? In my head, it would maybe look like this:
@ed.sieve
@ed.verify_all(rational)
def unload():
url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
trains = pd.read_csv(url, index_col=0)
return trains
trains_good, trains_bad = unload()
sieve
would have to catch all assertions, extract the indices of the rows that contain the error, and return a tuple of data frames. This might not make sense for all checks, but I think it makes sense for a lot of them?
I find that I often require two things from the same assumption-checking code:
Alternatively, get a single data frame with a column that indicates whether they passed the check.I understand the original intention of
engarde
is to fail early, and it does provide some tools for (2), but there are two particular pain points:verify_all
returns a dataframe inAssertionError.args[1]
. In others, it is less so:none_missing
returns a list of(index, column)
tuples, which all have to be passed topandas.DataFrame.loc
separately.Can
engarde
be used for my use case, or is that too far away fromengarde
's philosophy?