KxSystems / pyq

PyQ — Python for kdb+
http://code.kx.com/q/interfaces
Apache License 2.0
190 stars 49 forks source link

Pandas incompatibility #64

Closed antipisa closed 6 years ago

antipisa commented 6 years ago

OSX 64-bit PyQ version 4.1.2 kdb+ 32-bit Used virtualenv QHOME = /home/user/q/. Conda version 4.3.30

I noticed a bug in the pandas indexing engine when pyq is imported. There seems to be a conflict in how categorical Interval indexes are handled.

import pandas as pd
import numpy as np

t = pd.DataFrame(dict(sym=np.arange(2), y=1., z=-1.))
t.loc[:, 'x'] = pd.Series([pd.Interval(-1., 0.0, closed='right'), pd.Interval(0.0, 1, closed='right')])
t.set_index('x', inplace=True)
t.index = pd.Categorical(t.index)
t.loc[t.index.categories[0], :]

This returns

sym    0.0
y      1.0
z     -1.0
Name: (-1.0, 0.0], dtype: float64

in a Python 2.7 environment. However, in a pyq IPython environment, adding a pyq import statement

import pandas as pd
import numpy as np
from pyq import q

t = pd.DataFrame(dict(sym=np.arange(2), y=1., z=-1.))
t.loc[:, 'x'] = pd.Series([pd.Interval(-1., 0.0, closed='right'), pd.Interval(0.0, 1, closed='right')])
t.set_index('x', inplace=True)
t.index = pd.Categorical(t.index)
t.loc[t.index.categories[0], :]

results in a type error

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
TypeError: 'slice(0,2,None)' is an invalid key

So I suspect that the keys in a pandas categorical index are coerced into integer slices, which then cannot be used for index slicing.

I am using pandas 0.22 PyQ 4.1.2 numpy 1.14.2 IPython 5.3.0 kdb+ 3.5 64bit

abalkin commented 6 years ago

This is a curious case. I was able to trace this down to the following difference between how numpy behaves under pyq and in plain python:

$ pyq -c "import numpy as np;print(np.nextafter(0.0, 1.0).hex())"
0x0.0p+0
$ python -c "import numpy as np;print(np.nextafter(0.0, 1.0).hex())"
0x0.0000000000001p-1022

I suspect kdb+ messes up with the floating point unit in ways that numpy does not expect.

abalkin commented 6 years ago

We can demonstrate the issue without numpy:

$ pyq -c "print(float.fromhex('0x0.0000000000001p-1022'))"
0.0
$ python -c "print(float.fromhex('0x0.0000000000001p-1022'))"
5e-324
abalkin commented 6 years ago

.. or even simpler

$ pyq -c "print(2.0**(-1024))"
0.0
$ python -c "print(2.0**(-1024))"
5.562684646268003e-309
abalkin commented 6 years ago

I am going to close this issue as "wontfix". You can work around this issue by installing the "daz" package:

$ pip install daz

and adding the following codes near the start of your script:

import daz
daz.unset_ftz()
daz.unset_daz()

I would also recommend that you report this issue to the pandas team. They should not rely on subnormal floats behavior when dealing with categorical data. Bit-casting floats to integers when computing categorical labels should do the trick. See code in pandas/core/indexes/interval.py.

Note that for the purposes of reporting this to pandas, you can reproduce your issue in the stand-alone python by adding the following codes near the start of your script:

import daz
daz.set_ftz()
daz.set_daz()