AnacletoLAB / grape

🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations
MIT License
545 stars 39 forks source link

GRAPE on Heterogenous graphs #42

Open mvisani opened 1 year ago

mvisani commented 1 year ago

Hi !

I am a beginner in GNN and saw you repo and it seems that it could work for my problem but I just need to be sure. My goal is to try to predict the chemical composition of organisms across the tree of life. I have a CSV file that is similar to this example :

molecules species papers mol_pathway mol_sub_pathway species_domain species_family
H20 Homo sapiens 14 Terpenoids Monoterpenoids Eukaryotes Hominidae

So at each row we have unique pair of molecule-species (I'm thinking that would be the edge between 2 nodes of different type hence the Heterogenous graph), a certain number of papers that have actually found the molecule in that species (edge weight ?) , and then some information about the molecule and the species.

In this database there are 2 things we know : how species are related (classic phylogenic tree) and how molecules are related (group-subgroup structure seen above).

One fair assumption is that closely related species may share a similar set of molecules and molecules related in their synthesis may share a similar distribution across species. What I would like to have as a result is a matrix of s (species) by m (molecules) of probabilities that tell me if the edge between that molecule and that species could exist.

My questions are :

Sorry if those are very rooky questions, and thanks in advance for the reply ! :)

LucaCappelletti94 commented 1 year ago

So for starters:

  1. Graphs with multiple edge types and node types and multiple edges between any two given nodes are supported.
  2. Here you can find a tutorial on how to load graphs from CSVs, though admittedly it does not contain an example with heterogeneous edges - I will add it as soon as I have time. The docstring of the from csv method may be already enough, though.
  3. That is really something that depends on the embedding method you intend to use and what those features represent. If you already have features for each node, you may not need an embedding in the first place: why do you want an embedding?

Is the task modelled as an edge prediction task? We can do a call next week on the discord channel, just ping me and we can plan it.

Luca

mvisani commented 1 year ago

So for starters:

1. Graphs with multiple edge types and node types and multiple edges between any two given nodes are supported.

2. [Here you can find a tutorial on how to load graphs from CSVs](https://github.com/AnacletoLAB/grape/blob/main/tutorials/Loading_a_Graph_in_Ensmallen.ipynb), though admittedly it does not contain an example with heterogeneous edges - I will add it as soon as I have time. The docstring of the `from csv` method may be already enough, though.

3. That is really something that depends on the embedding method you intend to use and what those features represent. If you already have features for each node, you may not need an embedding in the first place: why do you want an embedding?

Is the task modelled as an edge prediction task? We can do a call next week on the discord channel, just ping me and we can plan it.

Luca

Hey ! Thanks for the reply !

Yes the task I am trying to achieve is an edge prediction task between a node of type molecule and one of type species (undirected). So far I was able to create a graph with the library. However, if I want to try to use an edge prediction task of your library and I specify node_features=[embedding_mol, embedding_species] I get this error :

ValueError: The provided node features have 48599 rows but the provided graph Lotus has 71410 nodes. Maybe these features refer to another version of the graph or another graph entirely?

Which makes sense since I have 48599 nodes that are of type "molecule" and the rest of type species. I'm thinking the 2 dataframes should be merged but I am wondering if there is an other solution since the number of features is not the same for species of molecules.

So to answer your question, maybe I don't need embedding but then how do I add features to each node ?

I'd be glad if we could have a call to have a better understanding of all this.

Tanks again !

Marco

LucaCappelletti94 commented 1 year ago

I am not sure how you expect a model to ingest such features - could you please describe how you would expect the model of your choosing to work?

LucaCappelletti94 commented 1 year ago

I am on discord if you'd like to have a call.

Filco306 commented 1 year ago

I have a similar issue; I am trying to load an undirected graph with edge types and weights; my csv that I generate looks like this:

head,relation,tail,weight
113091,14,412357,0.7917595
560244,14,1164306,0.7917595
388246,14,1121544,0.7917595
1102500,14,1142585,0.7917595
590896,14,661190,0.7917595
422681,14,501152,0.7917595
754343,14,1105352,0.7917595
639287,14,859151,0.7917595
270949,14,995611,0.7917595

I use the following snippet, adjusted from one of your tutorials:

graph_ = grape.Graph.from_csv(
        # Edges related parameters

        ## The path to the edges list tsv
        edge_path="companykg.csv",
        ## Set the tab as the separator between values
        edge_list_separator=",",
        ## The first rows should NOT be used as the columns names
        edge_list_header=True,
        ## The source nodes are in the first nodes
        sources_column_number=0,
        ## The destination nodes are in the second column
        destinations_column_number=2,
        ## Both source and destinations columns use numeric node_ids instead of node names
        edge_list_numeric_node_ids=True,
        ## The weights are in the third column
        weights_column_number=3,
        edge_type_path="companykg.csv",
        edge_types_column="relation",
        # Nodes related parameters
        ## The path to the nodes list tsv
        # node_path=None,
        ## Set the tab as the separator between values
        # node_list_separator="\t",
        ## The first rows should be used as the columns names
        # node_list_header=True,
        ## The column with the node names is the one with name "node_name".
        # nodes_column="node_name",

        # Graph related parameters
        ## The graph is undirected
        directed=False,
        ## The name of the graph is HomoSapiens
        name="CompanyKG",
        ## Display a progress bar, (this might be in the terminal and not in the notebook)
        verbose=True,
    )

However, getting the edge weights seem to not load it.

>>> graph_.get_undirected_edge_type_ids()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [27], in <cell line: 1>()
----> 1 graph_.get_undirected_edge_type_ids()

ValueError: The current graph instance does not have edge types.

Am I doing something wrong?

Thank you! :)

LucaCappelletti94 commented 1 year ago

Hi @Filco306, the two issues are very different in nature. In the former one, @mvisani was providing features that did not map to the nodes of the graph. I helped him design other features that did. Your issue is that you are loading an edge-type file, but you are not loading the edge types column from the edge path. Possibly I should raise an error when encountering these configurations, although they are not wrong per se - with the parametrization you are using, you are setting the vocabulary of edge types of the graph, but you are not loading the column of the edge types from the edge list. Most likely edge_type_path="companykg.csv", is not what you want.

LucaCappelletti94 commented 1 year ago

Weird, I see that I have already added extensive errors for this type of parametrization - which version of Ensmallen are you using? Which OS?

Filco306 commented 1 year ago

Interesting!

Version:

>>> grape.print_version()
{'GRAPE Version': '0.1.29', 'Python version': '3.10.6', 'Platform': 'Linux-5.4.0-150-generic-x86_64-with-glibc2.31', 'Threads number': 48, 'PyTorch version': '1.13.0', 'PyKEEN version': '1.9.0'}

Ensmallen version when I do pip freeze is ensmallen==0.8.36. I use Ubuntu 20.04.

Should I update the ensmallen package? Or how should I solve it? :)

Thank you for your quick reply!

LucaCappelletti94 commented 1 year ago

The latest version on ensmallen is 0.8.65, could you upgrade please?

Filco306 commented 1 year ago

Ah. Upgraded now, and now I get

ValueError: The path to the edge type file (not the edge list!) was provided and is `Some("companykg.csv")`, but you did not provide either `edge_list_edge_types_column` or `edge_list_edge_types_column_number` so to specify which column in the edge list should be loaded. Do note that the file provided to the edge type path should contain the UNIQUE edge types, and not the edge type for each edge. The edge type file is primarily used to ensure all edge types in the edge list are known before starting to process the edge list itself, which allows for additional assumptions and therefore significantly faster processing.
LucaCappelletti94 commented 1 year ago

Ok, which is what I was expecting you to get before. I hope the error I wrote contains sufficient information for you to correct the parametrization, if it remains unclear please do let me know so I can improve the error for the next version of Ensmallen.

Filco306 commented 1 year ago

Ah, I had to change one parameter, and now it works! However, loading the excellent analysis takes waaaaaay longer; I suppose it takes longer given certain analyses are run that weren't run before?

Thank you for a great package either way! I love it!

LucaCappelletti94 commented 1 year ago

As you have now included the edge types, it will incorporate them in the analysis. Most likely, the slowest new step is the isomorphic edge types detection step, which is similar in nature to the same thing I do for the nodes. At some point, I will make a faster version.

Filco306 commented 1 year ago

Okay! Is there a way to turn off certain analyses such as that one prior to starting the loading of the analysis and leaving in the rest?

LucaCappelletti94 commented 1 year ago

Hi @Filco306 - I am not sure I have understood your question, could you kindly expand upon it?

Filco306 commented 1 year ago

Yes absolutely! Is there a way to turn off/skip specifically the isomorphic edge detection step for the analysis summary, so that step is skipped and it finishes faster? That way, I could also test whether it is that step that is the issue.

LucaCappelletti94 commented 1 year ago

Currently, it runs the complete set of analyses for the whole graph as you have loaded it. In the default analysis that you get when you display the graph object, you cannot provide any parameter, and I tried to make it generally decently fast.

If you are now experiencing long runtimes after having loaded the edge types, it is likely that it is the isomorphic edge types analysis that is slow, meaning it is finding many near-isomorphic candidates. Note that this is not the analysis on isomorphic edges, which is another thing entirely.

If you do not load the edge types, does the analysis complete much faster?

Filco306 commented 1 year ago

Sorry for my late reply. The analysis still runs fairly slow with the update. I will return with a timing on both!