IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
https://intelpython.github.io/sdc-doc/
BSD 2-Clause "Simplified" License
645 stars 61 forks source link

Why @sdc_overload_method use func_text instead of normal python func definition? #999

Closed dlee992 closed 2 years ago

dlee992 commented 2 years ago

Now, I am digging into sdc source code, and I try to implement a limited version of pandas.dataframe.apply() for myself.

During this, I found your implementation of pandas.dataframe.head(). Why did you choose to implement an overload head method using func_text? Are there any reasons? Using normal python func definition will not work for sdc?

image

dlee992 commented 2 years ago

I think this kind of 'strange' implementation is related with current SDC limited support of pandas.dataframe(), which can only accept {'col_1':series_1, 'col_2':series2, ...} as raw data. Is this thought right? However, don't we have another normal and equivalent way that can achieve this goal?

Another important issue is that the lines of generated python code from codegen function is proportional to the dataframe columns. I think this will affect compilation performance a lot? E.g., one column only consumes 1 second, 1k columns can consume much more time? Not verified yet.

dlee992 commented 2 years ago

I want to contribute a PR about pandas.dataframe.apply(), a limited version of pandas original one. I found the rolling.apply() can be a good reference for me.

kozlov-alexey commented 2 years ago

I think this kind of 'strange' implementation is related with current SDC limited support of pandas.dataframe(), which can only accept {'col_1':series_1, 'col_2':series2, ...} as raw data. Is this thought right?

@dlee992 Hi, yes, this is related. In general series types can be different hence, to infer the resulting DataFrame type a const mapping from column names to series data is needed (and at least when this was written, there were no support for dicts with const literal names and heterogenous in type values in Numba). You are right that this has impact on compilation times of course, but our tests showed that with recent improvements (namely #936) DF constructor compiles quite fast (several minutes for DF of ~500 columns).

I want to contribute a PR about pandas.dataframe.apply()

Any PRs are very welcome! There's an alternative to using exec for building a resulting DF data. You can refer to the example below where df.drop is refactored via functions in sdc.functions.tuple_utils: https://gist.github.com/kozlov-alexey/f29e8d2703789491e8e24e41de16536b

dlee992 commented 2 years ago

@kozlov-alexey , hi, very much thanks!

I am reading your example df.drop, this is a very good hint.

I will try to mock one for df.apply, by the way to learn the underlying data structure of SDC DataFrame.