IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
BSD 2-Clause "Simplified" License
645 stars 61 forks source link

[BUG] python and sdc-compiled functions generate different output with same input #996

Closed dlee992 closed 2 years ago

dlee992 commented 2 years ago

Reporting a bug

In [25]: num_columns = 20
    ...: features = [f'col{i}' for i in range(num_columns)]
    ...: df = pd.DataFrame(np.random.rand(5, num_columns), columns=features)
    ...: target_col = 'col0'

In [26]: df
       col0      col1      col2      col3      col4      col5  ...    
0  0.847436  0.116855  0.782481  0.485027  0.027340  0.328801  ...  
1  0.482504  0.845380  0.753603  0.535273  0.243581  0.861275  ...  
2  0.190646  0.539439  0.901377  0.770925  0.908361  0.454777  ...  
3  0.355888  0.451189  0.672876  0.745438  0.576982  0.907190  ...  
4  0.535901  0.394481  0.118837  0.199040  0.557401  0.653302  ...  

[5 rows x 20 columns]

In [27]: def _modified_pipeline(df, target_col):
    ...:     samples = df[df['col1'] >= 0.2]
    ...:     p_sum = (samples[target_col] >= 0.5).sum()
    ...:     r_sum = (samples[target_col] <= 0.5).sum()
    ...:     cnt = len(samples)
    ...:     return p_sum, r_sum, cnt

In [28]: from numba import njit
    ...: @njit
    ...: def jit_modified_pipeline(df, target_col):
    ...:     samples = df[df['col1'] >= 0.2]
    ...:     p_sum = (samples[target_col] >= 0.5).sum()
    ...:     r_sum = (samples[target_col] <= 0.5).sum()
    ...:     cnt = len(samples)
    ...:     return p_sum, r_sum, cnt

In [29]: _modified_pipeline(df, target_col)
Out[29]: (1, 3, 4)

In [30]: jit_modified_pipeline(df, target_col)
<ipython-input-28-bbc0261853d0>:5: NumbaPerformanceWarning:
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.
Out[30]: (1, 2, 3)

As you can see, python and sdc obtain different outputs with the same inputs.

Python 3.7.9 & numba 0.52.0 & sdc 0.38.0 & pandas 1.2.0

kozlov-alexey commented 2 years ago

@dlee992 Hi, Thank you for the report! This is a bug in incorrect definition of a layout for the SeriesType. Unfortunately, strided arrays/series are not tested well in our tests, but we will fix it shortly. You can workaround it by creating DF from dictionary built from column names and transposed array (so that native array layout of columns is contiguous), i.e.

df = pd.DataFrame(dict(zip(features, p.random.rand(5, num_columns).transpose())))
dlee992 commented 2 years ago

@kozlov-alexey, thanks, the workaround makes sense. I tested a bit more, as below:

def foo1(df):
    choose_col = 'col1'
    filter_series = df[choose_col].apply(lambda x: 0 if x < 0.5 else 1)
    filtered_sum = (df[target_col] * filter_series).sum()
    return filtered_sum

def foo2(df):
    lst  = ['col1', 'col2', 'col3']
    for cho_col in lst:
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        filtered_sum = (df[target_col] * filter_series).sum()

foo1 can be compiled and executed with @njit, (but the filtered_sum has accuracy drift compared with python result, the degree of accuracy drift will become larger in dataframe with more rows, I am not sure that this drift is expected because of fastmath or some numerical optimizations, or just a bug), while foo2 can't be compiled, why does this happen?

# foo1 without and with njit, df has 10 rows
- [4.207398879779037, 10.0]
?                  ^

+ [4.207398879779036, 10.0]
?                  ^

# foo1 without and with njit, df has 10_000_000 rows
- [2501705.7589422013, 10000000.0]
?                 ^^^

+ [2501705.7589421924, 10000000.0]
?                ++ ^

# foo2 compilation error
  File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/", line 482, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/", line 423, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot request literal type.

File "", line 57:
def _modified_pipeline(df):
    <source elided>
    for cho_col in lst:
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)

During: typing of intrinsic-call at /Users/.../sdc/tests/tests_ant/ (57)

Tested on the newest master branch of SDC.

kozlov-alexey commented 2 years ago

@dlee992, The second error is kind of current limitation of SDC (mostly based on Numba, that is a JIT compiler with static typing), so generally iteration over heterogeneous collections using normal python syntax is forbidden. The reason is simple, in your example DF could have columns with different types, so that variable filtered_sum would need to be of different type on different iterations of the loop. Specifically for this, Numba provides literal_unroll feature, that allows code with minimal changes to be compiled, e.g.

from numba import literal_unroll

def foo2_error(df):
    lst  = ('col1', 'col2', 'col3')      # a tuple of column names instead of list
    results = []
    for cho_col in literal_unroll(lst):  # literal_unroll used
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        filtered_sum = (df[target_col] * filter_series).sum()

    return results

Above should work. Regarding the first problem, I think this deviation in precision is somewhat expected, as parallelization of sum with SDC has the consequence of values being added in different order. As far as I see, if operating on sorted sequence and using parallel=False, the summation via explicit loop gives exactly same result for compiled and pure python versions:

# on sorted data with sum via explicit loop:
arr_result:        2500270.1616518456   # numba jitted explicit loop, parallel=False
arr_result_ref:  2500270.1616518456   # python explicit loop 

But thank you to pointing this out, we will dig deeper if something can be improved.