KxSystems / pyq

PyQ — Python for kdb+
http://code.kx.com/q/interfaces
Apache License 2.0
190 stars 49 forks source link

String columns as arrays #88

Closed antipisa closed 4 years ago

antipisa commented 5 years ago
pyq.versions()
PyQ 4.1.3
NumPy 1.14.3
KDB+ 3.5 (2018.04.25) l64
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:17:00)
[GCC 7.2.0]

Columns of string type are difficult to work with since nested lengths are converted into either numpy arrays or numpy arrays of K lists.

Nested lists of equal length are converted to multi-dimensional np arrays:

np.asarray(K([[1,2],[3,4]]))
array([[1, 2],
       [3, 4]])

But nested lists of uneven lengths are converted to np array of K lists:

np.asarray(K([[0,1,2],[3,4]]))
array([k('0 1 2'), k('3 4')], dtype=object)

This is especially problematic when handling string columns. Converting to the column to symbol type first is unrealistic due to encoding

np.asarray(q("string `abc`defg`hijkl"))
array([k('"abc"'), k('"defg"'), k('"hijkl"')], dtype=object)

I am using the following hack for now, but it’s inefficient

np.asarray([bytes(x) for x in q("string `abc`defg`hijkl")])
array([b'abc', b'defg', b'hijkl'], dtype='|S5')

Ideally, we would get the same behavior as this

np.array([[0,1,2],[3,4]])
array([list([0, 1, 2]), list([3, 4])], dtype=object)

np.array([b"abc",b"defg",b"hijkl"])
array([b'abc', b'defg', b'hijkl'], dtype='|S5')
github-actions[bot] commented 4 years ago

Stale issue message