a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.02k stars 129 forks source link

Loading 300,000 graphs that doesn't fit into memory #193

Closed johnnytam100 closed 2 years ago

johnnytam100 commented 2 years ago

Hi @a-r-j and thanks for your help as always! I am trying to load ~300,000 graphein protein graphs by pickle.load (then do model.fit() with karate club), like this

import networkx as nx
from karateclub import FeatherGraph
import os
import glob
import pickle
import numpy as np
import pandas as pd

# Load
filepath_list = []

for filepath in glob.iglob('./*p'):
  filepath_list.append(filepath)

filepath_list.sort()

graph_list = []

for graph_path in filepath_list:

    # Load graph

    print("Loading...", graph_path)

    with open(graph_path, 'rb') as f:  # notice the r instead of w
        g_load = pickle.load(f)

    # Convert graph index to integer (required by karateclub)
    g_load_reindex = nx.convert_node_labels_to_integers(g_load)

    graph_list.append(g_load_reindex)

# Fit
model = FeatherGraph()
print("Fitting model...")
model.fit(graph_list)

However, the whole thing doesn't fit into memory. Do you know a smarter way that can bridge such a huge number of protein graphs to machine learning models? I am grateful if you can share some hints. Thank you!

linminhtoo commented 2 years ago

what does model.fit() do under the hood?

if it is running SGD (stochastic gradient descent) on batches of graphs, you don't need to load all graphs into memory at once. you just need to load batch by batch into memory. but this will require you to modify the code of model.fit()

a-r-j commented 2 years ago

Hi @johnnytam100 I had a quick look at the FeatherGraph model. It doesn't appear there are any learnable params from a cursory glance so I think you can simply load batches of your graphs into memory and compute the embeddings. You can probably parallelize it too for greater speed.

You can maybe find some inspiration from the ProteinGraphDataset class which does this (but only for the PyTorch ecosystem):

https://github.com/a-r-j/graphein/blob/e80b7d4ff47c30cab4c7a06e91c9897507f9fb7a/graphein/ml/datasets/torch_geometric_dataset.py#L478

johnnytam100 commented 2 years ago

Thank you so much for the advices!!!! 🙇🏻‍♂️