Open antipisa opened 3 weeks ago
If possible for you to use instead, PyKX will offer much improved performance here: https://code.kx.com/pykx/2.5/pykx-under-q/intro.html
PyKX enabled personal license: https://kx.com/kdb-insights-personal-edition-license-download/
Install:
pip install pykx
python -c "import pykx;pykx.install_into_QHOME()"
q)\l pykx.q
q)N:10000000
q)dat:([]date:2001.01.01; sym:`A; qty:N?100f)
q)\t .pykx.print .pykx.topd[dat]
date sym qty
0 2001-01-01 A 39.275238
1 2001-01-01 A 51.709112
2 2001-01-01 A 51.597965
3 2001-01-01 A 40.666419
4 2001-01-01 A 17.808386
... ... .. ...
9999995 2001-01-01 A 56.748184
9999996 2001-01-01 A 38.299587
9999997 2001-01-01 A 28.465881
9999998 2001-01-01 A 54.208987
9999999 2001-01-01 A 45.436839
[10000000 rows x 3 columns]
143
@rianoc-kx We have a lot of existing systems in embedpy and cannot just switch to using pykx though I understand kx has been pushing for this. Is there any way to replicate the .pykx.topd logic within embedpy?
Hi @antipisa,
Unfortunately there isn't a straightforward way of doing this, the approach PyKX takes to do this as efficiently as possible is quiet low-level and a significant amount of effort to rework would be needed for it to work with embedPy. Conceptually embedPy has no understanding of kdb+ table objects and it's approach to handling data is ultimately to assume a 1-to-1 mapping between numpy and q vectors at all times. We would need to introduce that concept and I'd estimate that the work to do it would be ~1-2 months.
The equivalent of the topd
really as it stands now is the ML Toolkit functionality for doing this, there may be some optimisations but as Rian points out PyKX really is where most of our efforts are concentrated today on Python/q integration.
Conor
Ok, but as with pyq there must be a way to handle columns more efficiently by type. For instance this works in pyq:
def tab2df(Kobj):
c = q.cols(Kobj)
c1 = Kobj.meta.exec("c", where='t in "bxhijef"')
df = pd.DataFrame(dict(q("{$[count x; flip x#y; ()!()]}", c1, Kobj)))
for col in q.except_(c,c1):
try:
df[col] = Kobj[col]
except AttributeError:
df[col] = np.array(K[col], dtype='str')
return df
@cmccarthy1 @rianoc-kx , you cannot keep releasing versions of kdb that integrate with python like embedpy and pyq, and then decommission them by forcing people to pay extra to use pykx. That's a total racket. Those who pay for kdb core licenses should be allowed to stay on older versions of pyq or embedpy -- or you should make pykx free. Creating new spinoff versions of pyq -- which incidentally was free -- and then withholding answers to the most basic questions is reprehensible.
Hi @antipisa
Apologies for the delay in response, we were looking into the issue to balance if a solution like the one you have suggested would be reasonable to integrate.
I understand your frustration at this and appreciate the feedback, we'll look to integrate something akin to the above in the ML-Toolkit rather than embedPy itself, as I mentioned previously the vast majority of embedPy is very low level and the addition of a dependency on Pandas for embedPy is something I would prefer us to avoid. It will likely be next week by the time we have a handle on what, if anything we can provide as a performance update.
Conor
@antipisa just for clarity with regards to your setup, what version of Pandas are you using?
This is a WIP but is ~ 10x faster
tab2df2:{
c:cols x;
c1:i.findCols[x;"bxhijef"];
df:i.pandasDF[{$[count y;y!x y;()!()]}[x;c1]];
cls:c except c1;
if[0=count cls;:df];
updTab:@[flip 0!x;i.findCols[x;"c"];enlist each];
timeCols:i.findCols[x;"pmdznuvt"];
timeTab:?[updTab;();0b;timeCols!timeCols];
timeTab:@[timeTab;timeCols;{dateConvert:("p"$@[4#+["d"$0];-16+type x]x)-"p"$1970.01m}];
df:{x[`:assign][z pykw .p.import[`numpy;`:array][y z;"datetime64[ns]"]]}/[df;count[timeCols]#enlist timeTab;timeCols];
{x[=;z;enlist $[11h=type dat:y z;string dat;dat]]}[df;updTab]each cls except timeCols;
df:df[`:reindex][`columns pykw c];
$[count keyTab:keys x;
df[`:set_index]keyTab;
df
]
}
I'll need to test it but should be able to integrate into the ML Toolkit as a replacement for current behaviour
q)N:10000000
q)dat:([]date:2001.01.01; sym:`A; qty:N?100f)
q)\t .ml.tab2df2[dat]
744
q)\t .ml.tab2df[dat]
9394
q).ml.df2tab[.ml.tab2df2 dat]~.ml.df2tab .ml.tab2df[dat]
1b
@cmccarthy1 is there an easier / simple raw conversion function that does not handle the timezone offsets dependency for pandas rather just does the standard kdb datatypes like float/timestamp/date/symbol? This would allow you to bypass pandas completely and just use numpy arrays. I am on pandas 2.2.1.
There isn't actually an issue with the timezones, the problem with timezones is in the reverse case namely going from dataframes to kdb tables. The implementation that's opened as an MR to the ML repository is a 1-to-1 match with the original behaviour as far as I can see from testing it. The comment on the timezones in the unedited message was a misunderstanding of the original issue
Hi @antipisa,
This has now been reviewed internally and merged to the ML Toolkit repo here.
Please let us know if you find any issues with the implementation there,
Conor
@cmccarthy1 thank you so much!!
What is the correct way to convert a large generic kdb table to a pandas dataframe? Could you share something similar to pyq's conversion function,
For example the function .ml.df2tab offered by Kx/ml is very slow: