abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
349 stars 23 forks source link

Redesign graph queries #72

Closed abstractqqq closed 3 months ago

abstractqqq commented 7 months ago
  1. A stand alone module for Graphs in the future for persistence

Graphs queries are very expensive and it is even more expensive when we need to reconstruct the graphs every time we run the queries provided by .graph right now. (This is the case for kdtrees right now) This is not a graph library, but can provide high performance Graph code for Data science purposes. I think it is appropriate to create a non polars-plugin for graphs that can return series, which means it works with eager dataframes.

  1. Simpler API (more priority)

List can represent edge list but I think having a node column and a "is_connected_to" column can also perfectly capture what we need from graphs. This also eliminates the need to implicitly indices that starts with 0 and will make the construction of graphs easier in the backend. User can simply do an explode to convert edge lists into this form as well.

rtbs-dev commented 4 months ago

I've done a lot of work on this in the past, and was actually looking into making a graph ext. for polars anyway. Actually, I think the way to go here is to represent the incidence flags in polars series/structs, such that a dataframe would represent different incidence structures/weights on the same point/line set. (see the wiki page here). This lets us get directed, simple, and hyper-graphs all in one datastructure (unlike petgraphs, iirc).

The trick is to then do our graph algorithms using the sparse matrix representations (COO), coming from the categorical incidence flags (the categoricals give us the indices/coords in the sparse matrix via get_physical.

I don't know if Rust has a native way to do sparse tensors that is well-used or agreed upon, but even just relying on the scipy sparse graph routines would give you a huge (!) performance boost vs. the traditional approaches, since iirc it already uses a lot of the old fortran linear algebra core libs (BLAS, LAPACK, etc).

IMO the path forward from there is to slowly rely more on a GraphBLAS backend, which iirc there are rust bindings for, if they might be a tad outdated.... everyone's doing stuff in python nowadays.

Say I want to treat my document-term matrix as a bipartite/hypergraph, and I want the biadjacency:

patt = (
    r'(?:[\w\d]+\b)'
    r'|(?:\b\w[\/\&]\w)\b'
    r'|(?:\b\w[\w\'\d]+)\b'
)

with pl.StringCache():
    tf = (df
     .with_columns(
         pl.col('OriginalShorttext')
         .str.to_lowercase()
         .str.extract_all(patt)
         .cast(pl.List(pl.Categorical))
         .alias('tokens')
     )
    )

gets us a typical dataframe with a new tokenized List(Categorical), such that every token has a consistent ID. Then, say I'm using the sparse library:

coords=(
    tf
    .select(
        'ID',
        pl.col('tokens')
        .list.eval(pl.element().to_physical())
    )
    .explode('tokens')
    .to_numpy().T
)
X = sparse.COO(coords=coords, data=1)

X is my biadjacency now. If the edges were weighted it's pretty straight-fwd to pass weights as data. Plus, a weighted graph via, say, the bipartite projection on tokens is just X.T@X, which gives the degree and adjacency matrix combo A+D, all still sparse (though, admittedly denser than we'd like, but backboning the hairball is it's own can of worms, and literally what my research is about, motivating my using polars for parallelization haha)

So, if we had a .sparse or .g accessor that represented the incidence matrices in-memory as sparse tensors, we open up all the graph capabilities of scipy sparse or GraphBLAS (which is a magnificent backend, highly recommend taking a look).

I've been working with some of our students/researchers to try this out in pandas, but a new polars module would be fantastic! Was just talking to @mkdjr about this yesterday.

abstractqqq commented 4 months ago

Thank you for sharing the ideas and the interest. I agree sparse matrices is the way to go and the library backend, Faer, which I am using, actually has great support for sparse matrices.

I used to have a eigenvector centrality function that passes through faer to get some help with sparse matrix multiplication and it worked out super well..

But yes the immediate future for the graph portion of this package is unclear. I am recently looking to rewrite a kdtree for data backed by ndarray and this is proving to be harder than I thought... Sigh..

Best of luck to your endeavors in graphs.

abstractqqq commented 4 months ago

Graph blas looks super interesting too!

abstractqqq commented 3 months ago

Graph related work will not be planned in near future.