engarde-dev / engarde

A library for defensive data analysis.
MIT License
501 stars 40 forks source link

ENH: Start of adding slicing #5

Closed jnmclarty closed 9 years ago

jnmclarty commented 9 years ago

Hi Tom!

Nice project you started here. What are you plans with it? This fits in nicely with a project I'm working on.

I'm curious if you would be inclined to add a slicing layer to the framework? This PR, is just to illustrate what it might look like. I haven't even tried to run this code. Want to bounce it off you, ask if you're even interested in contributors, etc, before I spent too much time on it.

It would enable checking "recent" data, against the values in the "pre-recent", and it would also be a massive speed boost, for those kinds of checks (as opposed to checking the entire thing.) I've prepared a second PR, before I had this idea, with a few one off examples where it might be appropriate.

jnmclarty commented 9 years ago

Obviously, I haven't integrated the logic into the decorators.

TomAugspurger commented 9 years ago

Hi Jeffrey, thanks for the PR.

The project is very much alpha at the moment. It seemed like a good test case for the new pipe method in pandas.

Re this PR, something like this would absolutely be useful. I'm going to think a bit about the API, since I can see it being applied to pretty much every function. I think your suggestion is the right way to go though.

jnmclarty commented 9 years ago

I started trying to use a few of the functions, using my first implementation method, and it started to feel akward, using parameters for passing the slice object.

within_range(df, slice(-1), items) or something akin to within_range(df, items, sl=slice(-1))

...so, I created an OO approach. Both 6486e65 and 30ef13d pass test_within_range, and both have strengths/weaknesses.

I'm going to sleep on it...let me know if you have any better ideas, for the API.

Edit: fixed specific commit.

jnmclarty commented 9 years ago

...thinking more about the API...

I think we can break the checks into bins based on the slicing input required, and then toggle the various output settings globally-esque using an instance of CheckSet or something.

Inputs

Each of the following could optionally come with args, kwargs.

  1. Data (+ implied slice to check) (+ implied slice to derive check)
  2. Data + explicit slice to check (+ implied slice to derive check) [+ indexing method]
  3. Data + explicit slice to derive check (+ implied slice to check) [+ indexing method]
  4. Data + explicit slice to check + explicit slice to derive check [+ indexing method]

Outputs

  1. AssertionException / Original Data (Implied Verified)
  2. CustomException / Original Data (Implied Verified)
  3. (Original Data, Bool) (Explicit Verified)
  4. (Original Data, ND-indexed Bools) (via Series/DataFrame/Panel)
  5. Bool
  6. Bool, Obj

SO...

I'm wondering if we couldn't create something clever along the lines of...

acheck(df, slc, slc_d, *args, ix=None, iloc=None, ix_d=None, iloc_d=None, *kwargs): where...

  1. we always type-check slc and slc_d, handle accordingly
  2. assume that Input number 3 is unlikely, and therefore people could use the appropriate *_d kwarg, for that case.
  3. assume that people will use these in factory methods, that remove the need for the slicing ugliness
jnmclarty commented 9 years ago

Ok, @TomAugspurger you're going to want to have a look at the API I've got started here.

The checks are slightly harder to write/read, but everything else is, IMO pretty slick.

TODO: Upgrade all the existing checks, tweak docs.

Take a look at my "examples.py"...a tad sloppy, it's right in the module folder for now.

jnmclarty commented 9 years ago

Oh, and included is a way to slice the frame for the check, and the way to slice for the derive.

"Derive" is the slice of the frame to use for calculating relative constants.

So, one could pass a frame with 3 years of history, calculate a standard deviation using the "derive" slice of say, year 1, 2, and 11 months, then check only the trailing month.

jnmclarty commented 9 years ago

...it'll be less confusing, if all checks use the same function signature.

TomAugspurger commented 9 years ago

Sorry it took me a bit longer to get back to this, although I have had time to stew things over.

I'm going to propose something a bit different than what you've got started (thanks for that btw, it really is helpful to your examples). My PR would look something like...

def check_slice(df=None, check=None, loc=None):
    subset = df.loc[loc]
    check(subset)
    return df

So using your example

In [2]: ind = pd.date_range('2010', '2015', freq='A')

In [3]: adf = pd.DataFrame({'one' : range(5), 'two' : [ i ** 2 for i in range(5)]}, index=ind)

In [4]: adf.ix[4,'two'] = pd.np.NaN

In [5]: adf
Out[5]:
            one  two
2010-12-31    0    0
2011-12-31    1    1
2012-12-31    2    4
2013-12-31    3    9
2014-12-31    4  NaN
In [12]: check_slice(adf, none_missing, (slice(None), ['one']))
Out[12]:
            one  two
2010-12-31    0    0
2011-12-31    1    1
2012-12-31    2    4
2013-12-31    3    9
2014-12-31    4  NaN

For use in pipelines, I'd suggest the user partially apply the function

In [18]: from engarde import checks

In [19]: from functools import partial
In [20]: even_monotonic = partial(check_slice, check=checks.is_monotonic,
     loc=(slice(None, None, 2), slice(None)))

It looks like your PR has a bit more functionality that my simple attempt here may not cover. I really want to keep this library simple since as you've seen I'm not the most responsive steward ;)

Anyway I'm going to play with each of this for a bit and see how they feel. I'd be curious to hear your thoughts.

jnmclarty commented 9 years ago

The check_slice idea, won't work as elegantly with the framework I need to integrate this with.

none_missing, is an example where the "check_slice" doesn't require data from the rest of the dataframe, but for the rest of the checks (eg, check_slice(df, within_std_dev, n=2)) ...you end up needing to inject the results of the calculation via arguments. Blah.

...yah, my PR as is, does build out a bit more functionality that you touched on. Actually, quite a bit.

Despite me not demoing it, the pipe functionality should still work, with everything I had built.

This might be where we fork, cause I need this functionality I built ASAP. Glad we chatted, don't think it was a waste for either of us. I'll give the hat-tip to your project, in my eventual docs, and will be watching what you do. Maybe we merge one day.

Amazingly, if you exclude our slicing implementation, our users could swap out our libraries interchangeably. With just an import statement change.

jnmclarty commented 9 years ago

It's likely buggy, and it's late...but this is where I'm taking my fork:

https://github.com/jnmclarty/validada

Hope it's okay with you. Let me know if there is any licensing issues. Don't know exactly what I need to do.

TomAugspurger commented 9 years ago

Cool. I'll keep my eye on that!

Licenses should be compatible. If you do need me to re-liscense it as something other than MIT let me know.