IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
https://intelpython.github.io/sdc-doc/
BSD 2-Clause "Simplified" License
645 stars 61 forks source link

Changing csv_reader_py impl to return df from objmode #918

Closed kozlov-alexey closed 3 years ago

kozlov-alexey commented 3 years ago

Motivation: returning Tuple of columns read from csv file with pyarrow csv reader from objmode and further calling init_dataframe ctor to create native DF turned out to be inefficient in sense of LLVM IR size and compilation time. With this PR we now rely on DF unboxing and return py DF from objmode.

Compile time of read_csv + df.count(): solutions\columns 4 8 16 32 64 128 256
Numba master + both SDC fixes (2b8b0034d74) 8.897234 9.306839 10.54691 12.52175 17.41399 30.47878 65.63396
Numba master + SDC fix #1 (964e498a9) 9.283413 9.83861 13.30219 21.7165 53.07618 187.4615 1026.31
Numba 0.50.1  + SDC master 9.212505 10.238 14.08183 25.16768 72.9872 290.3359 2141.832
Ratio (both fixes to master) 1.035435 1.100051 1.335162 2.009917 4.191296 9.525835 32.63299
pep8speaks commented 3 years ago

Hello @kozlov-alexey! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-09-07 22:18:36 UTC