zak-b2c2 commented 6 years ago

Questions

[ ] Which operating system are you using (if Linux, please provide flavour of it, i.e RedHat, CentOS or Ubuntu), is it 32-bit, or 64-bit?
```
Ubuntu 64-bit 16.04
```
[ ] Which version of PyQ are you running? Please provide output of pyq --versions, if PyQ isn't operational, please provide Python interpreter version and PyQ version python -V; python3 -V; pip list | grep pyq:
```
PyQ 4.1.4
NumPy 1.15.0
KDB+ 3.5 (2018.02.26) l64
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609]
```
[ ] Which version of kdb+ are you using, is it 32-bit or 64-bit?
```
64-bit    
```
- [ ] If on 64-bit, is your QLIC set? Please provide output env | grep QLIC on linux/macOS, or set|grep QLIC on Windows.
```
QLIC=/kdb-ebs/q
```
[ ] Did you use virtual environment to install PyQ? If not, why?
- Yes
[ ] Where is your QHOME? Please provide output env | grep QHOME on linux/macOS, or set|grep QHOME on Windows.
```
QHOME=/kdb-ebs/q
```
[ ] Do you use Conda? If so, what version?
```
No
```

Not necessarily an issue but trying to find the best way for converting large KDB datasets to Pandas DataFrame in PyQ. For my users, I have built a sandbox that loads in a HDB from disk, and then have an API to interface with the data. It works well and fine, and is quick to pull the data (no CPU or MEM issues). The users are excited to use it and appreciate that it is sharing the same memory space. However most of them are non-q users and therefore use Pandas and Numpy. What is the optimal way for dealing with large HDB datasets and having them in Pandas or Numpy?

Btw, thanks for all the work on PyQ, I'm happy to share the work I've done or helping with a case study

antipisa commented 5 years ago

@sashkab and @abalkin, I have found the optimal way to convert a K table to pandas dataframe is by looping over types. It has to do with the way pandas stores their numeric and object columns.

Python 3 solution:

def k_to_df(ktbl):
    c = q.cols(ktbl)
    c1 = ktbl.meta.exec('c', where='t in "hijbxef"')
    df = pd.DataFrame(dict(q('{$[count x;flip x#y;()!()]}',c1,ktbl)))
    for col in q.except_(c,c1):
        df[col] = ktbl[col]
    return df.reindex(c, axis='columns', copy=False)

zak-b2c2 commented 5 years ago

Thanks @antipisa but it seems it is pretty slow, or maybe I'm missing something? The below solution is much faster:

def to_pandas(tbl):
    """
    Converts a kdb+ table to Pandas DataFrame
    """
    df = pd.DataFrame()
    for c in tbl.cols:
        if (str(q('meta', tbl)[c]['t']) == 's') and getattr(tbl, c).inspect(b't') >= 20:
            d = list(q('sym',numpy.asarray(tbl[c].data)))
        elif str(q('meta', tbl)[c]['t']) == 'C':
            d = [str(x) for x in tbl[c]]
        else:
            d = numpy.asarray(tbl[c])
        df[c] = d
    return df

In [3]: fills.count
Out[3]: k('109112')
In [4]: %time to_pandas(fills)
CPU times: user 288 ms, sys: 44 ms, total: 332 ms
Wall time: 330 ms

In [6]: %time k_to_df(fills);
CPU times: user 24.7 s, sys: 196 ms, total: 24.9 s
Wall time: 25.7 s

github-actions[bot] commented 4 years ago

Stale issue message

KxSystems / pyq

How to deal with large datasets in Pandas #82

Questions