IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
https://intelpython.github.io/sdc-doc/
BSD 2-Clause "Simplified" License
645 stars 61 forks source link

Implements init_dataframe as multiple codegen functions #936

Closed kozlov-alexey closed 3 years ago

kozlov-alexey commented 3 years ago

Motivation: init_dataframe was implemented via Numba intrinsic taking args, which seems to generate redundant extractvalue/insertvalue LLVM instructions, producing quadratic IR when number of DF columns grows and affecting total compilation time of function that create large DFs. This PR replaces singe init_dataframe with multiple functions basing on number of columns in a DF which are generated at compile time, thus avoiding use of args.

n_columns   8 16 32 64 128 256 512
LLVM IR size, Mb on master 0.287622 0.55394 1.262865 3.383549 10.44003 35.79943 131.384
LLVM IR size, Mb With PR #936 0.143275 0.209119 0.341938 0.608992 1.143528 2.220672 4.406426
ratio without/with   2.007482 2.648924 3.693257 5.555986 9.12967 16.12099 29.81645
compilation time, s on master 0.521313 0.366884 0.67621 1.39326 4.603106 17.54948 126.7943
compilation time, s With PR #936 0.683099 0.413965 0.450348 0.715598 1.454044 3.210638 6.943996
ratio without/with   0.763159 0.886268 1.501529 1.946987 3.165726 5.466041 18.25956
pep8speaks commented 3 years ago

Hello @kozlov-alexey! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-11-13 15:02:36 UTC
kozlov-alexey commented 3 years ago

Test failures of read_csv tests with:

Failed in nopython mode pipeline (step: nopython rewrites) module 'sdc.hiframes.pd_dataframe_ext' has no attribute 'init_dataframe'

are expected because this PR requires changes from #918 which was rolled-back recently. So this will be blocked until #918 is returned.

AlexanderKalistratov commented 3 years ago

@kozlov-alexey @xaleryb win 3.6 build fails with svml error again:

test_series_apply_np (sdc.tests.test_series.TestSeries) ... LLVM ERROR: Symbol not found: __svml_log4_ha
kozlov-alexey commented 3 years ago

@kozlov-alexey @xaleryb win 3.6 build fails with svml error again:

test_series_apply_np (sdc.tests.test_series.TestSeries) ... LLVM ERROR: Symbol not found: __svml_log4_ha

I think something's wrong with the packages being used (see mkl and many others are installed from public channels, but not built). Can this be a reason?