AnacletoLAB / grape

🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations
MIT License
502 stars 38 forks source link

load graph from a Pandas Dataframe #43

Closed mkarmona closed 5 months ago

mkarmona commented 1 year ago

Big data engineering processes using Apache Spark produce triple sets. To avoid tedious IO serialisation and coalescing to/from CSV files PySpark provides toPandas() method. This method collects the partitioned and distributed dataset into the local memory of the driver node and make it accessible as Pandas data frame. Thus, having a graph constructor straight from already produced data frames will be really convenient.

pedges = edges.toPandas()
pnodes = nodes.toPandas() 

g = (Graph.from_pd(directed=True, 
                    node_path=pnodes,
                    nodes_column_number=0,
                    node_list_node_types_column_number=1,
                    edge_path=pedges,
                    sources_column_number=0,
                    destinations_column_number=2,
                    edge_list_edge_types_column_number=4,
                    weights_column_number=11)
       .remove_components(top_k_components=1)
    )
LucaCappelletti94 commented 1 year ago

Are the columns numeric, or do you expect them to contain strings?

mkarmona commented 1 year ago

All numeric

0   -4247916474242508806    111669168462    122270047432111156  7558004278719340179 4   0   508 22987   0   68  3   3
0   -4247916474242508806    120259094189    -4247916474242508806    2024546716798971474 2   0   508 841 0   9   2   2
1   3321359613095714626 34359742828 3321359613095714626 1021329052062355964 6   0   161 15561   1   10  1   1
1   3321359613095714626 94489307989 4459994629667120731 2024546716798971474 1   0   161 100 1   0   1   1
LucaCappelletti94 commented 1 year ago

Could you also provide an example of your node list?

mkarmona commented 1 year ago

sure! node id and node class id

42949702932 -4247916474242508806
333 -4247916474242508806
120259105872    122270047432111156
34359758249 8082227106116270368
34359751343 -4247916474242508806
103079232639    4459994629667120731
4201    -4247916474242508806
85899365480 3321359613095714626
4772    122270047432111156
LucaCappelletti94 commented 1 year ago

Why are there negative values?

mkarmona commented 1 year ago

For node IDs those come from a function to generate unique number at scale for a long list of them. For classes cardinality is small so I use a numeric hash function xxhash64.

zommiommy commented 1 year ago

I've implemented from_pd and this is an example of the usage:

nodes_df = pd.DataFrame(
    [("a", "user"), ("b", "user"), ("c", "product")],
    columns=["name", "type"],
)

edges_df = pd.DataFrame(
    [("a", "b", 1.0, "knows"), ("b", "c", 2.0, "bought")],
    columns=["subject", "object", "weight", "predicate"],
)

graph = Graph.from_pd(
    edges_df,
    nodes_df,
    node_name_column="name",
    node_type_column="type",
    edge_src_column="subject",
    edge_dst_column="object",
    edge_weight_column="weight",
    edge_type_column="predicate",
    directed=True,
    name="graph",
)

Would this be ok? We are still debugging it, but if it's ok we can publish a new version soon.

mkarmona commented 1 year ago

@LucaCappelletti94 thanks for your prompt reply and implementation! Can I assume column types are automatically extracted from the Pandas dataframe? If I guess so, then this parametrised interface works, indeed. Happy to check on my side as soon as the version is out.

LucaCappelletti94 commented 1 year ago

Hi @mkarmona - What do you mean by column types? If you refer to the data type of the node IDs, you seem to be using a i64. That cannot be used as a data type for a dense numeric range, as we compile ensmallen to use a u32. The use of a sparse range which includes negative values forces us to cast these node ids to strings. For numeric node ids to be used, they would need to be a dense positive range, from 0 to number of nodes.

zommiommy commented 1 year ago

Yeah, in the current implementation everything is treated as a string regardless of the type

hoktay commented 11 months ago

Just curios if this made it to a release yet ;)

LucaCappelletti94 commented 11 months ago

On Linux and macOS yes, but not on windows.

mkarmona commented 9 months ago

@LucaCappelletti94 @zommiommy thanks a lot for implementing this feature. I can confirm it works for me. This issue is done so please close it as you wish.