KxSystems / pyq

PyQ — Python for kdb+
http://code.kx.com/q/interfaces
Apache License 2.0
190 stars 49 forks source link

Symbol is broken when imported into Pandas DataFrame #2

Closed sitsang closed 7 years ago

sitsang commented 7 years ago

The following produce different result:

import pyq,pandas
pyq.q("sym:`a`b`c`d")
dict(pyq.q("flip ([]sym:`sym$`a`b`c`d;upd:til 4)"))['sym']

It returns a list of k symbol, which is expected:

k('`sym$`a`b`c`d')

But when imported into DataFrame, it becomes integer:

import pyq,pandas
pyq.q("sym:`a`b`c`d")
pandas.DataFrame(dict(pyq.q("flip ([]sym:`sym$`a`b`c`d;upd:til 4)")))['sym']
0    0
1    1
2    2
3    3

This is not right. I understand that the underlying enum is an integer, but enumerated sym should return as string in pandas, rather than a number.

abalkin commented 7 years ago

This is the result of a design decision to expose the memoryview of enums as integers:

>>> from pyq import q
>>> s = q("`sym?`a`b`c`d")
>>> m = memoryview(s)
>>> m.format
'l'

As a consequence, numpy arrays constructed from enums are integer as well:

>>> import numpy as np
>>> np.array(s)
array([0, 1, 2, 3])

If you want to get an array of strings, you need to de-enumerate s before passing it to the constructor:

>>> np.array(s.value)
array(['a', 'b', 'c', 'd'],
      dtype='|S1')

or

>>> np.array(s.value, 'O')
array(['a', 'b', 'c', 'd'], dtype=object)
abalkin commented 7 years ago

Note that q itself may treat enums as ints:

>>> del q.sym
>>> s
k('`sym!0 1 2 3i')

or in plain q:

q)s:`sym?`a`b`c`d
q)delete sym from `.
`.
q)s
`sym!0 1 2 3i
sitsang commented 7 years ago

I am aware an enum symbol is really an integer.

But it would be nice if we can specify whether the converted python object are using de-enumerate string or not.

Also, it is not possible to read a single element for an enum, which works fine for integer or symbol.

>>> pyq.q("`a`b`c`d")[0]
'a'
>>> pyq.q("`sym$`a`b`c`d")[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyq-3.8.1-py2.7-linux-x86_64.egg/pyq/__init__.py", line 355, in __getitem__
    return _k.K.__getitem__(self, x)
NotImplementedError: not implemented
sitsang commented 7 years ago

I think it would be beneficial to have a layer to convert the underlying k object into Python's object.

https://github.com/exxeleron/qPython did a great job in converting KDB object into Panda's data frame which handle date, time, and enumerated symbols, so that the values are understandable by human.

Is it easy to port the converter from qPython?

abalkin commented 7 years ago

Implementing __getitem__ for enums has long been on my todo list, but unfortunately Kx does not provide a public C API to do that efficiently. At some point I learned the private API for that, but I've learned the hard way that relying on private APIs in q is dangerous.

On the other hand, if we are not after a super-fast C implementation, we can add something like

>>> from pyq import *
>>> s = q("`sym?`a`b`c")
>>> s([0]).value[0]
'a'

as a case in __getitem__.

I'll add a feature request to our internal tracker.

abalkin commented 7 years ago

Is it easy to port the converter from qPython?

What we can do is to add an __array__ method to K objects that will return a record array when the K object is a table.

abalkin commented 7 years ago

@sitsang - We have recently released PyQ 4.0 which includes several enhancements related to table to record array conversions. See What's new in PyQ 4.0. For example, your original code now works as follows

>>> t = q('([]sym:`sym?`a`b`c`d;upd:til 4)')
>>> pandas.DataFrame(dict(t.flip))
  sym  upd
0   a    0
1   b    1
2   c    2
3   d    3