IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
https://intelpython.github.io/sdc-doc/
BSD 2-Clause "Simplified" License
645 stars 61 forks source link

Remove column names from DataFrame model #944

Closed kozlov-alexey closed 3 years ago

kozlov-alexey commented 3 years ago

Motivation: before this change column names were passed to DF ctor as arguments of LiteralString types (each name of it's own type), which seems to add to linear dependency of LLVM IR size and hence impact DF ctor compile time. Since this information is saved into DF type itself and can be captured in any of DF methods on typing it's proposed to remove columns from DF model struct as redundant.

kozlov-alexey commented 3 years ago

Some numbers on reducing IR/compilation time:

n_columns   8 16 32 64 128 256 512
LLVM IR size, B on master 379255 723548 1596652 4081771 12016803 39670140 142028921
LLVM IR size, B With PR #944 301594 580848 1324210 3547908 10947168 37538424 137766111
reduced by, %   20.48 19.72 17.06 13.08 8.90 5.37 3.00
compilation time, s on master 0.50796127 0.45004487 0.93897152 1.86501479 5.34555006 20.642166 128.379645
LLVM IR size, B With PR #944 0.42104888 0.36896634 0.63800144 1.33405638 3.82808781 16.295801 116.300309
reduced by, %   17.11 18.02 32.05 28.47 28.39 21.06 9.41