Redesign DataFrame structure

IntelPython / sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler

https://intelpython.github.io/sdc-doc/

BSD 2-Clause "Simplified" License

646 stars 62 forks source link

Redesign DataFrame structure #817

Closed akharche closed 4 years ago

akharche commented 4 years ago

Extension for #801

Implementation of new DataFrame structure based on lists instead of tuples Improved df.count() codegen for testing Example:

df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

(['A', 'B', 'C', 'D'],)
([array([1, 2, 3], dtype=int64), array([4, 5, 6], dtype=int64)], [array([0.5, 0.6, 0.7])], [array(['a', 'b', 'c'], dtype=object)])

Reproduce:

@njit
def run_df():
    df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

    print(df._columns)
    print(df._data)

    return df.count()

pep8speaks commented 4 years ago

Hello @akharche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-04-30 18:36:58 UTC

akharche commented 4 years ago

Performance results of DataFrame with mixed types of columns	Columns	Current compile time, s	New compile time, s
16	11.092775	9.713415	0.875652
32	28.611168	20.057998	0.701055
64	128.307209	67.369364	0.525062
128	803.671405	310.529887	0.386389

densmirn commented 4 years ago

Performance results of DataFrame with mixed types of columns

Columns Current compile time, s New compile time, s New/Current 4 11.092775 9.713415 0.875652 8 28.611168 20.057998 0.701055 16 128.307209 67.369364 0.525062 32 803.671405 310.529887 0.386389

The results look very good.

AlexanderKalistratov commented 4 years ago

@akharche examples failed