KxSystems / pyq

PyQ — Python for kdb+
http://code.kx.com/q/interfaces
Apache License 2.0
190 stars 49 forks source link

How to deal with large datasets in Pandas #82

Closed zak-b2c2 closed 4 years ago

zak-b2c2 commented 6 years ago

Questions

Not necessarily an issue but trying to find the best way for converting large KDB datasets to Pandas DataFrame in PyQ. For my users, I have built a sandbox that loads in a HDB from disk, and then have an API to interface with the data. It works well and fine, and is quick to pull the data (no CPU or MEM issues). The users are excited to use it and appreciate that it is sharing the same memory space. However most of them are non-q users and therefore use Pandas and Numpy. What is the optimal way for dealing with large HDB datasets and having them in Pandas or Numpy?

Btw, thanks for all the work on PyQ, I'm happy to share the work I've done or helping with a case study

antipisa commented 5 years ago

@sashkab and @abalkin, I have found the optimal way to convert a K table to pandas dataframe is by looping over types. It has to do with the way pandas stores their numeric and object columns.

Python 3 solution:

def k_to_df(ktbl):
    c = q.cols(ktbl)
    c1 = ktbl.meta.exec('c', where='t in "hijbxef"')
    df = pd.DataFrame(dict(q('{$[count x;flip x#y;()!()]}',c1,ktbl)))
    for col in q.except_(c,c1):
        df[col] = ktbl[col]
    return df.reindex(c, axis='columns', copy=False)
zak-b2c2 commented 5 years ago

Thanks @antipisa but it seems it is pretty slow, or maybe I'm missing something? The below solution is much faster:

def to_pandas(tbl):
    """
    Converts a kdb+ table to Pandas DataFrame
    """
    df = pd.DataFrame()
    for c in tbl.cols:
        if (str(q('meta', tbl)[c]['t']) == 's') and getattr(tbl, c).inspect(b't') >= 20:
            d = list(q('sym',numpy.asarray(tbl[c].data)))
        elif str(q('meta', tbl)[c]['t']) == 'C':
            d = [str(x) for x in tbl[c]]
        else:
            d = numpy.asarray(tbl[c])
        df[c] = d
    return df
In [3]: fills.count
Out[3]: k('109112')
In [4]: %time to_pandas(fills)
CPU times: user 288 ms, sys: 44 ms, total: 332 ms
Wall time: 330 ms

In [6]: %time k_to_df(fills);
CPU times: user 24.7 s, sys: 196 ms, total: 24.9 s
Wall time: 25.7 s
github-actions[bot] commented 4 years ago

Stale issue message