IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
https://intelpython.github.io/sdc-doc/
BSD 2-Clause "Simplified" License
645 stars 61 forks source link

Refactor df.drop to improve compile times #945

Closed kozlov-alexey closed 3 years ago

kozlov-alexey commented 3 years ago

Motivation: old implementation of df.drop() produced LLVM IR of large size on DFs with hundreds of columns, since it extracts all the columns from original DF and then packs them into a internal structure (a tuple of lists of arrays) again. The new implementation will make copy of internal df structure and just pop dropped columns from selected lists, which heavily reduces IR size and compilation time.

Some numbers: n_columns   8 16 32 64 128 256 512
LLVM IR size, Mb on master 0.46104 0.79895 1.475782 2.833022 5.595263 11.13747 22.23876
LLVM IR size, Mb With PR #945 0.478319 0.491075 0.516738 0.568007 0.671354 0.880129 1.297598
ratio without/with   0.963874 1.626943 2.855959 4.987649 8.334293 12.65437 17.13841
compilation time, s on master 1.009 1.188911 2.137561 4.167516 9.775089 23.66213 68.41024
compilation time, s With PR #945 0.908008 0.751528 0.992976 1.34102 2.412094 4.490081 8.600799
ratio without/with   1.111223 1.581993 2.152681 3.107722 4.052533 5.269869 7.95394