Closed jnmclarty closed 9 years ago
Obviously, I haven't integrated the logic into the decorators.
Hi Jeffrey, thanks for the PR.
The project is very much alpha at the moment. It seemed like a good test case for the new pipe
method in pandas.
Re this PR, something like this would absolutely be useful. I'm going to think a bit about the API, since I can see it being applied to pretty much every function. I think your suggestion is the right way to go though.
I started trying to use a few of the functions, using my first implementation method, and it started to feel akward, using parameters for passing the slice object.
within_range(df, slice(-1), items)
or something akin to within_range(df, items, sl=slice(-1))
...so, I created an OO approach. Both 6486e65 and 30ef13d pass test_within_range
, and both have strengths/weaknesses.
I'm going to sleep on it...let me know if you have any better ideas, for the API.
Edit: fixed specific commit.
...thinking more about the API...
I think we can break the checks into bins based on the slicing input required, and then toggle the various output settings globally-esque using an instance of CheckSet
or something.
Each of the following could optionally come with args, kwargs.
SO...
I'm wondering if we couldn't create something clever along the lines of...
acheck(df, slc, slc_d, *args, ix=None, iloc=None, ix_d=None, iloc_d=None, *kwargs):
where...
Ok, @TomAugspurger you're going to want to have a look at the API I've got started here.
The checks are slightly harder to write/read, but everything else is, IMO pretty slick.
TODO: Upgrade all the existing checks, tweak docs.
Take a look at my "examples.py"...a tad sloppy, it's right in the module folder for now.
Oh, and included is a way to slice the frame for the check, and the way to slice for the derive.
"Derive" is the slice of the frame to use for calculating relative constants.
So, one could pass a frame with 3 years of history, calculate a standard deviation using the "derive" slice of say, year 1, 2, and 11 months, then check only the trailing month.
...it'll be less confusing, if all checks use the same function signature.
Sorry it took me a bit longer to get back to this, although I have had time to stew things over.
I'm going to propose something a bit different than what you've got started (thanks for that btw, it really is helpful to your examples). My PR would look something like...
def check_slice(df=None, check=None, loc=None):
subset = df.loc[loc]
check(subset)
return df
So using your example
In [2]: ind = pd.date_range('2010', '2015', freq='A')
In [3]: adf = pd.DataFrame({'one' : range(5), 'two' : [ i ** 2 for i in range(5)]}, index=ind)
In [4]: adf.ix[4,'two'] = pd.np.NaN
In [5]: adf
Out[5]:
one two
2010-12-31 0 0
2011-12-31 1 1
2012-12-31 2 4
2013-12-31 3 9
2014-12-31 4 NaN
In [12]: check_slice(adf, none_missing, (slice(None), ['one']))
Out[12]:
one two
2010-12-31 0 0
2011-12-31 1 1
2012-12-31 2 4
2013-12-31 3 9
2014-12-31 4 NaN
For use in pipelines, I'd suggest the user partially apply the function
In [18]: from engarde import checks
In [19]: from functools import partial
In [20]: even_monotonic = partial(check_slice, check=checks.is_monotonic,
loc=(slice(None, None, 2), slice(None)))
It looks like your PR has a bit more functionality that my simple attempt here may not cover. I really want to keep this library simple since as you've seen I'm not the most responsive steward ;)
Anyway I'm going to play with each of this for a bit and see how they feel. I'd be curious to hear your thoughts.
The check_slice idea, won't work as elegantly with the framework I need to integrate this with.
none_missing, is an example where the "check_slice" doesn't require data from the rest of the dataframe, but for the rest of the checks (eg, check_slice(df, within_std_dev, n=2)) ...you end up needing to inject the results of the calculation via arguments. Blah.
...yah, my PR as is, does build out a bit more functionality that you touched on. Actually, quite a bit.
Despite me not demoing it, the pipe functionality should still work, with everything I had built.
This might be where we fork, cause I need this functionality I built ASAP. Glad we chatted, don't think it was a waste for either of us. I'll give the hat-tip to your project, in my eventual docs, and will be watching what you do. Maybe we merge one day.
Amazingly, if you exclude our slicing implementation, our users could swap out our libraries interchangeably. With just an import statement change.
It's likely buggy, and it's late...but this is where I'm taking my fork:
https://github.com/jnmclarty/validada
Hope it's okay with you. Let me know if there is any licensing issues. Don't know exactly what I need to do.
Cool. I'll keep my eye on that!
Licenses should be compatible. If you do need me to re-liscense it as something other than MIT let me know.
Hi Tom!
Nice project you started here. What are you plans with it? This fits in nicely with a project I'm working on.
I'm curious if you would be inclined to add a slicing layer to the framework? This PR, is just to illustrate what it might look like. I haven't even tried to run this code. Want to bounce it off you, ask if you're even interested in contributors, etc, before I spent too much time on it.
It would enable checking "recent" data, against the values in the "pre-recent", and it would also be a massive speed boost, for those kinds of checks (as opposed to checking the entire thing.) I've prepared a second PR, before I had this idea, with a few one off examples where it might be appropriate.