Closed dlee992 closed 2 years ago
I think this kind of 'strange' implementation is related with current SDC limited support of pandas.dataframe()
, which can only accept {'col_1':series_1, 'col_2':series2, ...}
as raw data. Is this thought right? However, don't we have another normal and equivalent way that can achieve this goal?
Another important issue is that the lines of generated python code from codegen
function is proportional to the dataframe columns. I think this will affect compilation performance a lot? E.g., one column only consumes 1 second, 1k columns can consume much more time? Not verified yet.
I want to contribute a PR about pandas.dataframe.apply()
, a limited version of pandas original one. I found the rolling.apply()
can be a good reference for me.
I think this kind of 'strange' implementation is related with current SDC limited support of
pandas.dataframe()
, which can only accept{'col_1':series_1, 'col_2':series2, ...}
as raw data. Is this thought right?
@dlee992 Hi, yes, this is related. In general series types can be different hence, to infer the resulting DataFrame type a const mapping from column names to series data is needed (and at least when this was written, there were no support for dicts with const literal names and heterogenous in type values in Numba). You are right that this has impact on compilation times of course, but our tests showed that with recent improvements (namely #936) DF constructor compiles quite fast (several minutes for DF of ~500 columns).
I want to contribute a PR about
pandas.dataframe.apply()
Any PRs are very welcome! There's an alternative to using exec
for building a resulting DF data. You can refer to the example below where df.drop
is refactored via functions in sdc.functions.tuple_utils
:
https://gist.github.com/kozlov-alexey/f29e8d2703789491e8e24e41de16536b
@kozlov-alexey , hi, very much thanks!
I am reading your example df.drop
, this is a very good hint.
I will try to mock one for df.apply
, by the way to learn the underlying data structure of SDC DataFrame.
Now, I am digging into sdc source code, and I try to implement a limited version of
pandas.dataframe.apply()
for myself.During this, I found your implementation of
pandas.dataframe.head()
. Why did you choose to implement anoverload head
method usingfunc_text
? Are there any reasons? Using normal python func definition will not work for sdc?