KxSystems / embedPy

Allows the kdb+ interpreter to call Python functions
https://code.kx.com/q/interfaces
Apache License 2.0
87 stars 42 forks source link

Fastest way to convert large kdb tables to pandas dataframes #119

Open antipisa opened 3 weeks ago

antipisa commented 3 weeks ago

What is the correct way to convert a large generic kdb table to a pandas dataframe? Could you share something similar to pyq's conversion function,

def to_pandas(tbl):
    """
    Converts a kdb+ table to Pandas DataFrame
    """
    df = pd.DataFrame()
    for c in tbl.cols:
        if (str(q('meta', tbl)[c]['t']) == 's') and getattr(tbl, c).inspect(b't') >= 20:
            d = list(q('sym',numpy.asarray(tbl[c].data)))
        elif str(q('meta', tbl)[c]['t']) == 'C':
            d = [str(x) for x in tbl[c]]
        else:
            d = numpy.asarray(tbl[c])
        df[c] = d
    return df

For example the function .ml.df2tab offered by Kx/ml is very slow:

i.pandasDF:.p.import[`pandas]`:DataFrame

// @kind function
// @category utilities
// @desc Convert q table to Pandas dataframe
// @param tab {table} A q table
// @return {<} a Pandas dataframe
.ml.tab2df:{[tab]
  updTab:@[flip 0!tab;i.findCols[tab;"c"];enlist each];
  transformTab:@[updTab;i.findCols[tab]"pmdznuvt";i.q2npDate];
  pandasDF:i.pandasDF[transformTab][@;cols tab];
  $[count keyTab:keys tab;
    pandasDF[`:set_index]keyTab;
    pandasDF
    ]
  }

N:10000000
dat:([]date:2001.01.01; sym:`A; qty:N?100f)
\t .ml.tab2df dat
55440
rianoc-kx commented 2 weeks ago

If possible for you to use instead, PyKX will offer much improved performance here: https://code.kx.com/pykx/2.5/pykx-under-q/intro.html

PyKX enabled personal license: https://kx.com/kdb-insights-personal-edition-license-download/

Install:

pip install pykx
python -c "import pykx;pykx.install_into_QHOME()"
q)\l pykx.q
q)N:10000000
q)dat:([]date:2001.01.01; sym:`A; qty:N?100f)
q)\t .pykx.print .pykx.topd[dat]
                date sym        qty
  0       2001-01-01   A  39.275238
  1       2001-01-01   A  51.709112
  2       2001-01-01   A  51.597965
  3       2001-01-01   A  40.666419
  4       2001-01-01   A  17.808386
  ...            ...  ..        ...
  9999995 2001-01-01   A  56.748184
  9999996 2001-01-01   A  38.299587
  9999997 2001-01-01   A  28.465881
  9999998 2001-01-01   A  54.208987
  9999999 2001-01-01   A  45.436839

  [10000000 rows x 3 columns]
143
antipisa commented 2 weeks ago

@rianoc-kx We have a lot of existing systems in embedpy and cannot just switch to using pykx though I understand kx has been pushing for this. Is there any way to replicate the .pykx.topd logic within embedpy?

cmccarthy1 commented 2 weeks ago

Hi @antipisa,

Unfortunately there isn't a straightforward way of doing this, the approach PyKX takes to do this as efficiently as possible is quiet low-level and a significant amount of effort to rework would be needed for it to work with embedPy. Conceptually embedPy has no understanding of kdb+ table objects and it's approach to handling data is ultimately to assume a 1-to-1 mapping between numpy and q vectors at all times. We would need to introduce that concept and I'd estimate that the work to do it would be ~1-2 months.

The equivalent of the topd really as it stands now is the ML Toolkit functionality for doing this, there may be some optimisations but as Rian points out PyKX really is where most of our efforts are concentrated today on Python/q integration.

Conor

antipisa commented 2 weeks ago

Ok, but as with pyq there must be a way to handle columns more efficiently by type. For instance this works in pyq:


def tab2df(Kobj):
    c = q.cols(Kobj)
    c1 = Kobj.meta.exec("c", where='t in "bxhijef"')
    df = pd.DataFrame(dict(q("{$[count x; flip x#y; ()!()]}", c1, Kobj)))
    for col in q.except_(c,c1):
        try:
            df[col] = Kobj[col]
        except AttributeError:
            df[col] = np.array(K[col], dtype='str')
    return df
antipisa commented 4 days ago

@cmccarthy1 @rianoc-kx , you cannot keep releasing versions of kdb that integrate with python like embedpy and pyq, and then decommission them by forcing people to pay extra to use pykx. That's a total racket. Those who pay for kdb core licenses should be allowed to stay on older versions of pyq or embedpy -- or you should make pykx free. Creating new spinoff versions of pyq -- which incidentally was free -- and then withholding answers to the most basic questions is reprehensible.

cmccarthy1 commented 4 days ago

Hi @antipisa

Apologies for the delay in response, we were looking into the issue to balance if a solution like the one you have suggested would be reasonable to integrate.

I understand your frustration at this and appreciate the feedback, we'll look to integrate something akin to the above in the ML-Toolkit rather than embedPy itself, as I mentioned previously the vast majority of embedPy is very low level and the addition of a dependency on Pandas for embedPy is something I would prefer us to avoid. It will likely be next week by the time we have a handle on what, if anything we can provide as a performance update.

Conor

cmccarthy1 commented 3 days ago

@antipisa just for clarity with regards to your setup, what version of Pandas are you using?

cmccarthy1 commented 3 days ago

This is a WIP but is ~ 10x faster

tab2df2:{
  c:cols x;
  c1:i.findCols[x;"bxhijef"];
  df:i.pandasDF[{$[count y;y!x y;()!()]}[x;c1]];
  cls:c except c1;
  if[0=count cls;:df];
  updTab:@[flip 0!x;i.findCols[x;"c"];enlist each];
  timeCols:i.findCols[x;"pmdznuvt"];
  timeTab:?[updTab;();0b;timeCols!timeCols];
  timeTab:@[timeTab;timeCols;{dateConvert:("p"$@[4#+["d"$0];-16+type x]x)-"p"$1970.01m}];
  df:{x[`:assign][z pykw .p.import[`numpy;`:array][y z;"datetime64[ns]"]]}/[df;count[timeCols]#enlist timeTab;timeCols];
  {x[=;z;enlist $[11h=type dat:y z;string dat;dat]]}[df;updTab]each cls except timeCols;
  df:df[`:reindex][`columns pykw c];
  $[count keyTab:keys x;
    df[`:set_index]keyTab;
    df
    ]
 }

I'll need to test it but should be able to integrate into the ML Toolkit as a replacement for current behaviour

q)N:10000000
q)dat:([]date:2001.01.01; sym:`A; qty:N?100f)
q)\t .ml.tab2df2[dat]
744
q)\t .ml.tab2df[dat]
9394
q).ml.df2tab[.ml.tab2df2 dat]~.ml.df2tab .ml.tab2df[dat]
1b
antipisa commented 3 days ago

@cmccarthy1 is there an easier / simple raw conversion function that does not handle the timezone offsets dependency for pandas rather just does the standard kdb datatypes like float/timestamp/date/symbol? This would allow you to bypass pandas completely and just use numpy arrays. I am on pandas 2.2.1.

cmccarthy1 commented 3 days ago

There isn't actually an issue with the timezones, the problem with timezones is in the reverse case namely going from dataframes to kdb tables. The implementation that's opened as an MR to the ML repository is a 1-to-1 match with the original behaviour as far as I can see from testing it. The comment on the timezones in the unedited message was a misunderstanding of the original issue

cmccarthy1 commented 22 hours ago

Hi @antipisa,

This has now been reviewed internally and merged to the ML Toolkit repo here.

Please let us know if you find any issues with the implementation there,

Conor

antipisa commented 13 hours ago

@cmccarthy1 thank you so much!!