Graph-Learning-Benchmarks / gli

🗂 Graph Learning Indexer: a contributor-friendly and metadata-rich platform for graph learning benchmarks. Dataloading, Benchmarking, Tagging, and more!
https://graph-learning-benchmarks.github.io/gli/
MIT License
42 stars 20 forks source link

[FEATURE REQUEST] GRAPE integration #407

Open LucaCappelletti94 opened 1 year ago

LucaCappelletti94 commented 1 year ago

Hi! I am one of the creators of the 🍇 grape 🍇 library. We have 80K+ graphs in the automatic retrieval portion of the library, most from real-world biomedical papers. We also provide efficient methods to run graph holdouts and evaluate edge predictions, node-label predictions and node-label predictions.

I'd be happy to help integrate these into your work and boost the possible datasets your users may want to use.

Best, Luca

jiaqima commented 1 year ago

Hi Luca, thank you so much for your interest. It would be fantastic to have GRAPE integrated into GLI. We are more than happy to provide any help to make that happen. I'll send you an email to follow up on more details.

LucaCappelletti94 commented 1 year ago

As mentioned earlier in the email, here are a few examples of things you can trivially do with 🍇 once you install it with pip install grape:

Retrievable graphs

One feature of GRAPE is quickly retrieving graphs and knowledge graphs. This allows users to easily access and use pre-existing KGs in their analyses rather than manually retrieving or building them from scratch. These graphs may come from various sources and can be used for multiple purposes, including machine learning, data visualization, and network analysis.

You can get the list of all retrievable graphs by running the following:

from grape.datasets import get_all_available_graphs_dataframe

df = get_all_available_graphs_dataframe()

Which will get you a pandas DataFrame such as this one:

How to retrieve a generic graph

Here follows how you can retrieve any graph programmatically, given the repository, graph and version:

from grape.datasets import get_dataset

graph_class = get_dataset(
    graph_name="KGCOVID19",
    repository="kghub",
    version="20221102"
)
graph_instance = graph_class()

How to load a graph from CSVs

To load any custom graph from CSVs, please rely on this extended tutorial, as there are a LOT of parameters

How to retrieve a specific graph

Suppose you want to retrieve the graph I recovered above using get_dataset, more cleanly. You can use the following:

from grape.datasets.kghub import KGCOVID19

kgcovid = KGCOVID19()

Graph holdouts

Graph holdouts are techniques for partitioning a graph into training and test sets to evaluate machine learning algorithms or other methods for analyzing graphs. In GRAPE, several types of graph holdouts are available, including connected holdouts, random holdouts, and k-folds. All of these methods can also be applied to specific edge types. These holdouts are implemented in Rust to provide the fastest possible user experience and use techniques such as copy-on-write (COW) to avoid duplicating shared memory structures and allow for efficient holdouts on large graphs.

Some features are only available in specific holdouts, but if there is interest, I may spend time implementing them across different methods.

Connected holdouts

Connected holdouts involve dividing the graph into a training graph and a test graph, where the training graph has the same connected components as the whole graph. This is necessary when a graph is making a closed-word assumption and assumes that there are no edges between connected components.

train, test = graph.connected_holdout(
    train_size= 0.8,
    random_state = 45,
    edge_types = ["an edge type of interest", "and another one"],
    include_all_edge_types = True, # To avoid biases in multigraphs, but it depends on the task
    minimum_node_degree = 5, # sometimes evaluating predictions on low-degree nodes is not interesting 
    maximum_node_degree = 100, # sometimes evaluating predictions on high-degree nodes is not interesting
    verbose = True, # whether to show a loading bar
)

Random holdout

Random holdouts involve randomly selecting a portion of the graph for the test set. This may create multiple components in the training graph. Whether this is a plus or a problem depends on how the graph was modelled and the evaluation's assumptions.

train, test = graph.random_holdout(
    train_size= 0.8,
    random_state = 45,
    edge_types = ["an edge type of interest", "and another one"],
    include_all_edge_types = True, # To avoid biases in multigraphs, but it depends on the task
    min_number_overlaps = 2, # Number of multigraph edges necessary to select this edge for the test set.
    verbose = True, # whether to show a loading bar
)

K-folds

K-folds involve dividing the graph into k equal-sized partitions, using one partition as the test set and the remaining k-1 partitions as the training set, and repeating the process k times.

train, test = graph.get_edge_prediction_kfold(
    k = 10, # Number of folds 
    k_index = 3, # Number of current fold 
    edge_types = ["an edge type of interest", "and another one"],
    random_state = 42
    verbose = True
)

Node-label and edge-label holdouts

In addition, GRAPE also includes methods for creating node-label and edge-label holdouts.

train, test = graph.get_node_label_holdout_graphs(
    train_size=0.8,

    use_stratification=True,
    random_state=45678,
)
train, test = graph.get_edge_label_holdout_graphs(
    train_size=0.8,
    use_stratification=True,
    random_state=45678,
)

And analogously, there are also the corresponding kfolds methods.

Models

GRAPE includes a variety of machine learning models for analyzing graphs and embeddings, as well as tools for creating custom models. These models include both those developed specifically for use in GRAPE and those from other libraries that have been integrated into the GRAPE interface. Some examples of models available in GRAPE include node embedding models such as DeepWalk and node2vec, edge embedding models such as struc2vec and HNE, and graph classification models such as GCN and GraphSAGE. Additionally, GRAPE provides an interface that allows users to wrap and use models from other libraries within the GRAPE environment. This allows users to easily incorporate a wide range of models into their analyses, and provides flexibility and customization options.

You can get the list of available models as such:

from grape import get_models_dataframe

models = get_models_dataframe()

Which gets you:

I hope this is a comprehensive first introduction, thought you can find more information in the tutorials here.