IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
https://intelpython.github.io/sdc-doc/
BSD 2-Clause "Simplified" License
645 stars 61 forks source link

add dataframe apply #1003

Closed dlee992 closed 2 years ago

dlee992 commented 2 years ago

I implmented a limited version of pandas.DataFrame.apply, which can allow axis=1 and elements of output dataframe are all types.float64, which can accelerate apply api by 5-10x in single core, by 20-30x in eight cores.

Actually, my implementation has some limitations (some from sdc, some from numba, some from my own):

  1. args[-1] must be a tuple, and its length must be the same with the length of output dataframe columns. Besides, I want to use the value of args[-1] as the output dataframe column names, but I don't figure out a way to implement this idea, instead I have to generate a list named col_names to use in DataFrameType init
  2. right now, my implementation unsupports kwargs used in origin pandas.DataFrame.apply
  3. right now I assume axis=1, raw=False, result_type=None, which could be enhanced later
  4. Related with limitation 1, I want to move DataFrameType init code into impl body for using the args[-1] value, but get_structure_maps cannot move into impl, since not jitted, any idea about it?
  5. right now, I assume func returns Series type, which could be enhanced later too, e.g., allow list or np.ndarray type.
  6. Last but not the least, I want users who use compiled apply can provide the each colunmn's type information of output dataframe, e.g., all types.float64, types.string, even mixed: the 1st column with types.float64, the other colunmns with types.string, but how to implement it? Any suggestion? Now, I only implement a version of all output types aretypesss.float64.

Any comments are welcome! Thanks! @kozlov-alexey @shssf