alteryx / woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.
https://woodwork.alteryx.com
BSD 3-Clause "New" or "Revised" License
145 stars 20 forks source link

Bug in `.ww.describe()` that stems from the computation of `percentile`: will either give KeyError or return the wrong value. #1872

Open JanetMatsen opened 1 month ago

JanetMatsen commented 1 month ago

The problem

There is a .iat missing in the definition of percentile that is used by .ww.describe() (see here).

Reproducible example, using woodwork==0.31.0

This demonstrates the bug:

import pandas as pd
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df['new_index'] = [3, 4, 5, 6, 7]     #  Note that for this example, the 1st index value can't be 1.  Use any other integer. 
df.set_index('new_index', inplace=True)
import woodwork as ww
df.ww.init()
df.ww.describe()

The error is KeyError: 1 because it's trying to look up index 1 in the a Series. (Here I've set it to 3).

The bug comes from the woodwork definition of percentile for a pandas series:

import math
def percentile(N, percent, count):                                                                     
    """                                                                                                
    Source: https://stackoverflow.com/questions/2374640/how-do-i-calculate-percentiles-with-python-nump

    Find the percentile of a list of values.                                                           

    Args:                                                                                              
        N (pd.Series): Series is a list of values. Note N MUST BE already sorted.                      
        percent (float): float value from 0.0 to 1.0.                                                  
        count (int): Count of values in series                                                         

    Returns:                                                                                           
        Value at percentile.                                                                           
    """                                                                                                
    k = (count - 1) * percent                                                                          
    f = math.floor(k)                                                                                  
    c = math.ceil(k)                                                                                   
    if f == c:                                                                                         
        return N[int(k)]        # <-- the stack overflow question was written for lists, and this line needs to be modified for pd.Series.                                                           
    d0 = N.iat[int(f)] * (c - k)                                                                       
    d1 = N.iat[int(c)] * (k - f)                                                                       
    return d0 + d1  

Run this example by setting series = df['a'] and running percentile() directly.

series = df['a']
percentile(N=series, percent=0.25, count=len(series))

Result:

In [3]: series = df['a']
   ...: percentile(N=series, percent=0.25, count=len(series))
   ...:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:2606, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:2630, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1

Note that in cases when int(k) happens to be a value in the index it will give you an answer, but there is no reason it will actually correspond to the intended position!

If I'd had

df2 = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df2['new_index'] = [3, 4, 5, 6, 1] 
df2.set_index('new_index', inplace=True) 

then percentile(N=df2['a'], percent=0.25, count=len(series)) gives 5 for the 25th percentile 😱

In [9]: df2 = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
   ...: df2['new_index'] = [3, 4, 5, 6, 1]
   ...: df2.set_index('new_index', inplace=True)

In [10]: df2
Out[10]:
           a
new_index
3          1
4          2
5          3
6          4
1          5

In [11]: percentile(N=df2['a'], percent=0.25, count=len(series))
Out[11]: 5

The fix

The fix is quite simple. Add a .iat like the lines below have.

import math 

# Proposed fix (add .iat to the if f==c case):
def percentile(N, percent, count):                                                                     
    """                                                                                                
    Source: https://stackoverflow.com/questions/2374640/how-do-i-calculate-percentiles-with-python-nump

    Find the percentile of a list of values.                                                           

    Args:                                                                                              
        N (pd.Series): Series is a list of values. Note N MUST BE already sorted.                      
        percent (float): float value from 0.0 to 1.0.                                                  
        count (int): Count of values in series                                                         

    Returns:                                                                                           
        Value at percentile.                                                                           
    """                                                                                                
    k = (count - 1) * percent                                                                          
    f = math.floor(k)                                                                                  
    c = math.ceil(k)                                                                                   
    if f == c:                                                                                         
        return N.iat[int(k)]      # Fix is just addint .iat                                                                         
    d0 = N.iat[int(f)] * (c - k)                                                                       
    d1 = N.iat[int(c)] * (k - f)                                                                       
    return d0 + d1  

Now it works:

In [5]: series = df['a']
   ...: percentile(N=series, percent=0.25, count=len(series))
Out[5]: 2
JanetMatsen commented 1 month ago

I tried to push a (one-line) fix to this, but don't seem to have permission to push.

(venv) ➜  woodwork git:(issue1872-bugfix_in_describe_percentile) ✗ 
git push origin issue1872-bugfix_in_describe_percentile
ERROR: Permission to alteryx/woodwork.git denied to JanetMatsen.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.