Closed dlee992 closed 2 years ago
@dlee992 Hi, Thank you for the report! This is a bug in incorrect definition of a layout for the SeriesType
. Unfortunately, strided arrays/series are not tested well in our tests, but we will fix it shortly. You can workaround it by creating DF from dictionary built from column names and transposed array (so that native array layout of columns is contiguous), i.e.
df = pd.DataFrame(dict(zip(features, p.random.rand(5, num_columns).transpose())))
@kozlov-alexey, thanks, the workaround makes sense. I tested a bit more, as below:
def foo1(df):
choose_col = 'col1'
filter_series = df[choose_col].apply(lambda x: 0 if x < 0.5 else 1)
filtered_sum = (df[target_col] * filter_series).sum()
return filtered_sum
def foo2(df):
lst = ['col1', 'col2', 'col3']
for cho_col in lst:
filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
filtered_sum = (df[target_col] * filter_series).sum()
foo1
can be compiled and executed with @njit, (but the filtered_sum
has accuracy drift compared with python result, the degree of accuracy drift will become larger in dataframe with more rows, I am not sure that this drift is expected because of fastmath or some numerical optimizations, or just a bug), while foo2
can't be compiled, why does this happen?
# foo1 without and with njit, df has 10 rows
- [4.207398879779037, 10.0]
? ^
+ [4.207398879779036, 10.0]
? ^
# foo1 without and with njit, df has 10_000_000 rows
- [2501705.7589422013, 10000000.0]
? ^^^
+ [2501705.7589421924, 10000000.0]
? ++ ^
# foo2 compilation error
File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/dispatcher.py", line 482, in _compile_for_args
error_rewrite(e, 'typing')
File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/dispatcher.py", line 423, in error_rewrite
raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot request literal type.
File "test.py", line 57:
def _modified_pipeline(df):
<source elided>
for cho_col in lst:
filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
^
During: typing of intrinsic-call at /Users/.../sdc/tests/tests_ant/test_ant_9.py (57)
Tested on the newest master branch of SDC.
@dlee992, The second error is kind of current limitation of SDC (mostly based on Numba, that is a JIT compiler with static typing), so generally iteration over heterogeneous collections using normal python syntax is forbidden. The reason is simple, in your example DF could have columns with different types, so that variable filtered_sum
would need to be of different type on different iterations of the loop. Specifically for this, Numba provides literal_unroll
feature, that allows code with minimal changes to be compiled, e.g.
from numba import literal_unroll
@njit
def foo2_error(df):
lst = ('col1', 'col2', 'col3') # a tuple of column names instead of list
results = []
for cho_col in literal_unroll(lst): # literal_unroll used
filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
filtered_sum = (df[target_col] * filter_series).sum()
results.append(filtered_sum)
return results
Above should work. Regarding the first problem, I think this deviation in precision is somewhat expected, as parallelization of sum with SDC has the consequence of values being added in different order. As far as I see, if operating on sorted sequence and using parallel=False, the summation via explicit loop gives exactly same result for compiled and pure python versions:
# on sorted data with sum via explicit loop:
arr_result: 2500270.1616518456 # numba jitted explicit loop, parallel=False
arr_result_ref: 2500270.1616518456 # python explicit loop
But thank you to pointing this out, we will dig deeper if something can be improved.
Reporting a bug
As you can see, python and sdc obtain different outputs with the same inputs.
Python 3.7.9 & numba 0.52.0 & sdc 0.38.0 & pandas 1.2.0