ZaxR / bulwark

Bulwark is a package for convenient property-based testing of pandas dataframes.
GNU Lesser General Public License v3.0
223 stars 26 forks source link

Might we add an optional message against the checks? #47

Open ianozsvald opened 5 years ago

ianozsvald commented 5 years ago

Consider ck.has_no_nans(df_modeling_data); where I only want to cause an error if my dataframe contains NaNs before it goes into scikit-learn.

It would be informative to add an error message that is reported to the user, something like ck.has_no_nans(df_modeling_data, message="scikit-learn requires no NaNs to be present"); so we have some context along with our check.

Might it be sensible to add optional messages to the checks? The benefit would be the accumulation of documentation about the expected output of transformations and the expected state of ingested data which otherwise might be forgotten or written in comments which may not keep up with the checks themselves.

ZaxR commented 5 years ago

I like the idea of overwriting the error message, and it's been on my mind. Currently I'm more focused on enhancing the default error messaging (see issue #13), so I'd appreciate your input there as well, but this would be a nice feature.

The biggest challenge in my mind would be creating a common interface to overriding the error message across checks, some of which call other checks (has_no_nans calls has_no_x under the hood) and/or have different error messages depending on state (see has_columns).

Thoughts? And would you be willing to take on adding this capability?

ianozsvald commented 5 years ago

I probably won't help with this as it isn't a blocker for me but I might know folk who will. I run a mailing list for data scientists, I'll direct them at this in the next issue (early next week) and hopefully some will opt to help.

ianozsvald commented 5 years ago

Thinking out loud - a "reason for" message (e.g. a because= or reason=) would propagate separately to the diagnostic/failure message (your msg). If one method calls another then the reason-for message can just be passed along. To my mind these two strings serve different purposes. We don't always need the reason-for message, we do always need the diagnostic.