abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
349 stars 23 forks source link

Support for graph series/algorithms #55

Closed furlat closed 1 month ago

furlat commented 8 months ago

I have in some situations columns containing lists of edges, or even rows of adjacency matrix like representations for my data, it would be lovely to apply graph based operations like collecting all the rows in a shortpath from a row (node) to another.

Happy to elaborate it this does not sound stupid

abstractqqq commented 8 months ago

Hi @furlat ,

It is something I find interesting, although dataframes may not be the most optimal data structure to do this efficiently. That said I think some graph support should be added. Can you please provide a minimal example of a dataframe, and the operation you want to perform on it? Thanks

furlat commented 8 months ago

Hi thanks a lot for the fast reply!!

Something like this? Totally made up and ignoring the various df context just showing expressions

Simulated Polars DataFrame for Graph Representation

import polars as pl

# Sample input data
data = {
    'node_id': [1, 2, 3, 4, 5],
    'edges': [[2, 3], [1, 4], [1, 5], [2, 5], [3, 4]]
}

GraphDataFrame = pl.DataFrame(data)
print(GraphDataFrame.head(5))

shape: (5, 2)
┌─────────┬─────────┐
│ node_id ┆ edges   │
│ ---     ┆ ---     │
│ i64     ┆ list[i] │
╞═════════╪═════════╡
│ 1       ┆ [2, 3]  │
│ 2       ┆ [1, 4]  │
│ 3       ┆ [1, 5]  │
│ 4       ┆ [2, 5]  │
│ 5       ┆ [3, 4]  │
└─────────┴─────────┘

# Degree of each node
degree_series = GraphDataFrame.graph.node_metrics.degree()
print(degree_series.head(5))

# Output
shape: (5, 1)
┌────────┐
│ degree │
│ ---    │
│ i64    │
╞════════╡
│ 2      │
│ 2      │
│ 2      │
│ 2      │
│ 2      │
└────────┘
# Eigenvector centrality of each node
centrality_series = GraphDataFrame.graph.node_metrics.eigenvector_centrality()
print(centrality_series.head(5))

# Output
shape: (5, 1)
┌────────────────────┐
│ eigenvector_centr  │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.5                │
│ 0.6                │
│ 0.4                │
│ 0.7                │
│ 0.5                │
└────────────────────┘
# Shortest path distances from all nodes to node_id 3, with path represented as list of node IDs
distances = GraphDataFrame.graph.distances.short_path(3, algorithm='dijkstra', create_path_column=True)
print(distances.head(5))

# Output
shape: (5, 3)
┌─────────┬──────────┬────────────┐
│ node_id ┆ distance ┆ path       │
│ ---     ┆ ---      ┆ ---        │
│ i64     ┆ i64      ┆ list[i64]  │
╞═════════╪══════════╪════════════╡
│ 1       ┆ 1        ┆ [1, 3]     │
│ 2       ┆ 1        ┆ [2, 3]     │
│ 3       ┆ 0        ┆ [3]        │
│ 4       ┆ 2        ┆ [4, 2, 3]  │
│ 5       ┆ 1        ┆ [5, 3]     │
└─────────┴──────────┴────────────┘
# Shortest path distances from all nodes to node_ids 3 and 4
targets = [3, 4]
distances = GraphDataFrame.graph.distances.short_path(targets, algorithm='dijkstra', create_path_column=True)
print(distances.head(5))

# Output
shape: (5, 5)
┌─────────┬─────────────────┬───────────────┬─────────────────┬───────────────┐
│ node_id ┆ distance_to_3   ┆ path_to_3     ┆ distance_to_4   ┆ path_to_4     │
│ ---     ┆ ---             ┆ ---           ┆ ---             ┆ ---           │
│ i64     ┆ i64             ┆ list[i64]     ┆ i64             ┆ list[i64]     │
╞═════════╪═════════════════╪═══════════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1               ┆ [1, 3]        ┆ 2               ┆ [1, 2, 4]     │
│ 2       ┆ 1               ┆ [2, 3]        ┆ 1               ┆ [2, 4]        │
│ 3       ┆ 0               ┆ [3]           ┆ 1               ┆ [3, 4]        │
│ 4       ┆ 1               ┆ [4, 2, 3]     ┆ 0               ┆ [4]           │
│ 5       ┆ 1               ┆ [5, 3]        ┆ 1               ┆ [5, 4]        │
└─────────┴─────────────────┴───────────────┴─────────────────┴───────────────┘
# Grouping based on node clusters and aggregating
clustered_grouping = GraphDataFrame.graph_group_by.aggolmerative_clustering(num_clusters=2, depth=1).agg([
    pl.col("cluster"),
    pl.col("edges").list().alias("connected_nodes")
])
print(clustered_grouping.head(5))

# Output
shape: (2, 2)
┌─────────┬──────────────────────────┐
│ cluster ┆ connected_nodes          │
│ ---     ┆ ---                      │
│ i64     ┆ list[list[i]]            │
╞═════════╪══════════════════════════╡
│ 0       ┆ [[2, 3], [1, 4]]         │
│ 1       ┆ [[1, 5], [2, 5], [3, 4]] │
└─────────┴──────────────────────────┘
furlat commented 8 months ago

I forgot to mention why I am posting this here, because this sort of graphs a lof of time comes out of nearest neighbor filtering of some vector space and i see it just a naturally next step of processing after obtaining the list of neighbors from your example

df.with_columns(
    pl.col("id").num.knn_ptwise(
        pl.col("val1"), pl.col("val2"), 
        k = 3, dist = "haversine", parallel = True
    ).alias("nearest neighbor ids")
)

shape: (5, 6)
┌─────┬──────────┬──────────┬──────────┬──────────┬──────────────────────┐
│ id  ┆ val1     ┆ val2     ┆ val3     ┆ val4     ┆ nearest neighbor ids │
│ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---                  │
│ i64 ┆ f64      ┆ f64      ┆ f64      ┆ f64      ┆ list[u64]            │
╞═════╪══════════╪══════════╪══════════╪══════════╪══════════════════════╡
│ 0   ┆ 0.804226 ┆ 0.937055 ┆ 0.401005 ┆ 0.119566 ┆ [0, 3, … 0]          │
│ 1   ┆ 0.526691 ┆ 0.562369 ┆ 0.061444 ┆ 0.520291 ┆ [1, 4, … 4]          │
│ 2   ┆ 0.225055 ┆ 0.080344 ┆ 0.425962 ┆ 0.924262 ┆ [2, 1, … 1]          │
│ 3   ┆ 0.697264 ┆ 0.112253 ┆ 0.666238 ┆ 0.45823  ┆ [3, 1, … 0]          │
│ 4   ┆ 0.227807 ┆ 0.734995 ┆ 0.225657 ┆ 0.668077 ┆ [4, 4, … 0]          │
└─────┴──────────┴──────────┴──────────┴──────────┴──────────────────────┘

I think being able to natively parallelize and combine graph operations with all the rest of polars filtering over the nodes attribute (an incredibly underspecified situation for graph frameworks). Also in all honesty I am being a bit biased but I would love to move all my non-gpu stack to polars to stay away from native python cpu processing. I end up always spending so much time in managing things that are trivial queries in polars for whatever data structure the package of the day wants.

abstractqqq commented 8 months ago

@furlat

I believe I have a working eigenvector centrality. See the branch here: https://github.com/abstractqqq/polars_ds_extension/tree/graph and take a look in graph.py.

It is pretty slow rn, taking 2.8ms for 1000 rows, each with 10-150 neighbors. I also added query_radius_ptwise to help me generate these edge list with unequal sizes. You can find it in examples.ipynb.

Will you be interested in testing it? (That would require you to build and compile the package locally). The reason I am asking is that I am mostly learning these topics as I go and it would be great if someone can help me with testing and debugging.

furlat commented 8 months ago

whaooo! Definetely yes, unfortunately I can't help for the immediate next 3 days, but from next week I can dedicate quite some time to testing and helping on the python side!

Also interesting github https://github.com/Splines/fast-louvain . fast-louvain is quite a formidable algo and easily extend to temporal or multiplexed graphs, again it does not seems fully completed but it has also a nice manual on the graph theory part https://splines.github.io/fast-louvain/modularity/formula.html

abstractqqq commented 7 months ago

Some graph stuff is there. But on second thought I really think a separate data structure is ideal for Graph. As long as the graph structure can return arrays/things that can easily be turned into Series, I think they can work with dataframes. Having graph queries work with lazy frames is like a dream.. Building a graph takes a lot of effort, and they cannot really be re-used when they are expressions, and graph algorithms are typically slower than the rest of the expressions...

So after some experimentation with graph expressions, I think the API you proposed above makes more sense, although it has to be altered to some extent.. I will try to make more progress in v.0.3.x

furlat commented 7 months ago

Hi! Unfortunately I have a big presentation tomorrow and got a bit delayed on my testing I will start playing around with the package again from the week-end and will be able to give you some proper feedback.

I have some ideas regarding your points above but I am not sure they are correct, so I will first take my time to play around with the graph algos (and benchmark vs dumping to numpy) and update you asap!

abstractqqq commented 7 months ago

Hi! Unfortunately I have a big presentation tomorrow and got a bit delayed on my testing I will start playing around with the package again from the week-end and will be able to give you some proper feedback.

I have some ideas regarding your points above but I am not sure they are correct, so I will first take my time to play around with the graph algos (and benchmark vs dumping to numpy) and update you asap!

No worries and no rush. I have decided to implement most graph algos first as expressions, just to make it convenient to use first, despite it being slow (possibly cannot be any faster).

FYI, I am planning to release v0.3.1 soon. Graph is refactored and the petgraph crate is used instead of Pathfinding.

abstractqqq commented 1 month ago

Won't fix. Graph related features will no longer be supported..