KxSystems / pyq

PyQ — Python for kdb+
http://code.kx.com/q/interfaces
Apache License 2.0
190 stars 49 forks source link

Null value interpreted differently when imported to DataFrame from dictionary #4

Closed sitsang closed 7 years ago

sitsang commented 7 years ago

I found that we get different result when constructing a panda Dataframe from dictionary or array:

>>> pandas.DataFrame(dict(pyq.q("flip ([]empty:3#0Ni)")))
        empty
0 -2147483648
1 -2147483648
2 -2147483648

Null integer should be mapped to the None type. This is fine when we import to DataFrame as an array:

>>> pandas.DataFrame(list(dict(pyq.q("flip ([]empty:3#0Ni)"))['empty']))
      0
0  None
1  None
2  None
>>>
abalkin commented 7 years ago

This is not entirely unexpected because pandas conversion goes through numpy array and

>>> x = q('3#0Ni')
>>> np.array(x)
array([-2147483648, -2147483648, -2147483648])

I don't know how pandas deals with missing values, but pyq has some support for numpy.ma:

>>> np.ma.array(x)
masked_array(data = [-- -- --],
             mask = [ True  True  True],
       fill_value = 999999)

0N's are also treated specially when K vectors are converted to lists:

>>> list(x)
[None, None, None]

I don't think there is much we can do to improve this. Support for missing values is flaky in numpy and I am not sure pandas improves much in this area. We can probably document this behavior better, but this is true about most of pyq features - documentation can certainly see some improvement.

sitsang commented 7 years ago

I see. Is it mandatory to create a panda dataframe through numpy?

Shouldn't it be possible to generate dataframe through the list function rather than the array function?

abalkin commented 7 years ago

Is it mandatory to create a panda dataframe through numpy?

Since pandas DataFrame keeps its data in a numpy.ndarray, going through numpy is the most direct way to convert from pyq to pandas. Note that in many cases pyq to numpy conversion can be achieved without any copying. See https://pyq.enlnt.com/slides/#/5.

Shouldn't it be possible to generate dataframe through the list function rather than the array function?

It is possible, but will be much slower:

In [12]: x = q.til(10000)

In [13]: %timeit pandas.DataFrame({'a': np.asarray(x)})
10000 loops, best of 3: 193 µs per loop

In [14]: %timeit pandas.DataFrame({'a': list(x)})
100 loops, best of 3: 2.13 ms per loop