AnacletoLAB / grape

🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations
MIT License
502 stars 38 forks source link

Question concerning node type loading from csv #33

Closed 13bmartens closed 1 year ago

13bmartens commented 1 year ago

Hi Team, thank you for the awesome library!

I am trying to import a very basic dataset for a POC and struggling with the _fromcsv method.

I want to construct a graph using my own data:

pl.DataFrame({
    'source': ['A', 'A', 'A', 'A', 'A', 'F', 'F', 'F', 'A', 'F'],
    'destination': ['B', 'C', 'D', 'E' , 'F', 'G', 'H', 'I', 'J', 'J'],
}
).write_csv('edges.csv')

pl.DataFrame({
    'node_type': ['link', 'sat', 'sat', 'sat' , 'sat', 'link', 'sat', 'sat', 'sat', 'sat'],
}
).write_csv('node_types.csv')

pl.DataFrame({
    'node_name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']}
).write_csv('node_names.csv')

Resulting in three csv files with a header each.

I am then constructing the graph using the following snippet:

graph = Graph.from_csv(
    #Edges
    edge_path="edges.csv",
    sources_column="source",
    destinations_column="destination",
    edge_list_header=True,
    #Nodes
    node_path = "node_names.csv",
    nodes_column = "node_name",
    node_list_header  = True,
    #Node Types
    node_type_path = "node_types.csv",
    node_types_column = "node_type",
    node_type_list_header = True,
    skip_node_types_if_unavailable = False,
    directed = False
)

When I run graph.get_node_type_names() I get the error:

ValueError                                Traceback (most recent call last)
Cell In [6], line 1
----> 1 graph.get_unique_node_type_ids()

ValueError: The current graph instance does not have node types.

Anything I am doing wrong?

Thanks for the time!

LucaCappelletti94 commented 1 year ago

Hi @13bmartens! The node files should have a column assigning to each node the node types, something like:

node_a,node type of node a
node_b,first node type of node b|second node type of node b

Note the possibility of providing multiple node types using the | in this example.

The node types file should only contain the unique node types, and its primary use is that you can specify the node types numerically in the nodes CSV, so in the aforementioned example you could write:

node_a,0
node_b,1|2

and the associated node types file would look like this:

node type of node a
first node type of node b
second node type of node b

Using the numeric node type IDs and the associated node types file is preferable but not necessary, and mostly makes the CSVs smaller, and more compressible (so you can move the data a bit more easily) and the loading time is much faster as we can make more assumptions about the data being loaded. Moreover, if the file is smaller, fewer data need to be read, and so the IO bottleneck is reduced.

I hope this answers your question.

LucaCappelletti94 commented 1 year ago

I have added a check to the CSV reader raising an error when the node type (and edge type) file path is provided and no other parameter binding node types to the node list is provided. In the future version, upon parametrizing in this incomplete way the loader, you will receive an explanation on how to correct the parametrization.

13bmartens commented 1 year ago

Thank you for the quick reply @LucaCappelletti94!

I got the example working using your input:


pl.DataFrame({
    'source': ['A', 'A', 'A', 'A', 'A', 'F', 'F', 'F', 'A', 'F'],
    'destination': ['B', 'C', 'D', 'E' , 'F', 'G', 'H', 'I', 'J', 'J'],
}
).write_csv('edges.csv')

pl.DataFrame({
    'node_type': ['link', 'sat'],
}
).write_csv('node_types.csv')

pl.DataFrame({
    'node_name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'node_type': ['link', 'sat', 'sat', 'sat' , 'sat', 'link', 'sat', 'sat', 'sat', 'sat'],
}
).write_csv('node_names.csv')

graph = Graph.from_csv(
    #Edges
    edge_path="edges.csv",
    sources_column="source",
    destinations_column="destination",
    edge_list_header=True,
    #Nodes
    node_path = "node_names.csv",
    nodes_column = "node_name",
    node_list_header  = True,
    node_list_node_types_column = "node_type",
    #Node Types
    node_type_path = "node_types.csv",
    node_types_column = "node_type",
    node_type_list_header = True,
    directed = False
)
LucaCappelletti94 commented 1 year ago

Happy to hear that! I will be closing the issue then. Feel free to re-open if you encounter again related problems. I am also available on GRAPE discord server