dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.38k stars 3k forks source link

Instruction about preparing custom dataset for sampling #644

Closed tengwei12315 closed 3 years ago

tengwei12315 commented 5 years ago

❓ Questions and Help

Hi, In the code example, the sampling in mxnet, can you load your own dataset like sse, if so, what should I do? Any answer will be appreciated.

zheng-da commented 5 years ago

i'm not sure what you mean. Our README gives quite a few examples. https://github.com/dmlc/dgl/blob/master/examples/mxnet/sampling/README.md

tengwei12315 commented 5 years ago

Thank you for your answer, I want to run my own dataset using mxnet's sapmling model. What should I do?

tengwei12315 commented 5 years ago

There is another question. How does Nodeflow and sampling in the tutorial write a broken question? Is there a problem with this model?

VoVAllen commented 5 years ago

There's no problem with the model. The broken image is from doc building stages, and we will try to fix this soon.

zheng-da commented 5 years ago

sorry, what is broken? the example code in the tutorial should work as long as you can load your data into DGLGraph. Do you find troubles of loading datasets?

VoVAllen commented 5 years ago

I think @tengwei12315 is refering to the broken tutorial image at https://docs.dgl.ai/tutorials/models/index.html#training-on-giant-graphs.

tengwei12315 commented 5 years ago

sorry, what is broken? the example code in the tutorial should work as long as you can load your data into DGLGraph. Do you find troubles of loading datasets?

How can I turn my Dataset into DGLGraph input? Where can I find a tutorial example?

zheng-da commented 5 years ago

You can convert your data into a NetworkX graph or a scipy sparse matrix and pass it to DGLGraph to construct one.

tengwei12315 commented 5 years ago

Hello, if my data set has characteristics, how should I input the characteristics of my data set? If my data set has no characteristics, like Karate Club, how should I set the input of features? Do you have any sample code?Thank you !

zheng-da commented 5 years ago

Unfortunately, I think we don't have example code for this. If your graph data has no node features, you can use one-hot encoding and get embeddings from an embedding matrix, just like many NLP tasks. Similarly, if your data set has characteristics as node data, you can also use one-hot encoding.

jermainewang commented 5 years ago

I found this is quite a common question. We should write a tutorial about how to prepare custom dataset. This could be done together with our plan of data format. @VoVAllen

CapsulE07 commented 4 years ago

waitting for an instruction of building a custom dataset

mufeili commented 4 years ago

waitting for an instruction of building a custom dataset

What kind of scenario are you dealing with? If you simply have a graph with node features and you want to do some node classification with sampling-based training, then you can refer to this example. Basically you only need to have a graph object and a tensor for node features.

bhavaygg commented 3 years ago

@mufeili I am facing something similar is that I have a number of networkx graphs that have node and edge attributes and I want to use these graphs for graph classification. The current examples in the documentation only show how to download and use pre-existing datasets but not how one can start with their own networkx graph list.

mufeili commented 3 years ago

@mufeili I am facing something similar is that I have a number of networkx graphs that have node and edge attributes and I want to use these graphs for graph classification. The current examples in the documentation only show how to download and use pre-existing datasets but not how one can start with their own networkx graph list.

You can create a list of DGLGraphs for your dataset from the networks graphs with

# Assume networkx_graphs is a list of networks graphs.
self.graphs = [dgl.from_networkx(nx_g) for nx_g in networkx_graphs]

For more details on loading node/edge attributes, see from_networkx.

bhavaygg commented 3 years ago

@mufeili if I try to follow this guide to make a graph classifier. i have a list of torch data objects which i feed into the dataloader using dataloader = DataLoader(graphs,batch_size=1024,collate_fn=collate,drop_last=False,shuffle=True). Even if the graphs here are DGLGraphs or torch data objects, the dataloader shows num_samples = 0. Apart from using the dataloader, i don't know how to feed the data. This requires the data object to have feature,label,mask attributes which I am not sure how to assign.

mufeili commented 3 years ago

Have you checked the user guide on graph classification? How did you define collate?

bhavaygg commented 3 years ago

I fixed the error for number of samples being 0 and other basic issues. I made my functions similar to the tutorial.

def collate(samples):
        graphs, labels = map(list, zip(*samples))
        batched_graph = dgl.batch(graphs)
        batched_labels = torch.tensor(labels)
        return batched_graph, batched_labels

dataloader = DataLoader(train_dataset,batch_size=1024,collate_fn=collate,drop_last=False,shuffle=True)

And the training loop

for epoch in range(20):
    for batched_graph, labels in dataloader:

But this is creating an issue AttributeError: 'MultiDiGraph' object has no attribute 'is_block'

mufeili commented 3 years ago

I fixed the error for number of samples being 0 and other basic issues. I made my functions similar to the tutorial.

def collate(samples):
        graphs, labels = map(list, zip(*samples))
        batched_graph = dgl.batch(graphs)
        batched_labels = torch.tensor(labels)
        return batched_graph, batched_labels

dataloader = DataLoader(train_dataset,batch_size=1024,collate_fn=collate,drop_last=False,shuffle=True)

And the training loop

for epoch in range(20):
    for batched_graph, labels in dataloader:

But this is creating an issue AttributeError: 'MultiDiGraph' object has no attribute 'is_block'

You need to convert the networkX graphs into DGLGraphs first. MultiDiGraph is a class for directed multigraphs in NetworkX.

bhavaygg commented 3 years ago

@mufeili Also, my graphs do not have any features associated with them, only the nodes and edges are, so for a graph Graph(num_nodes=410, num_edges=1500, ndata_schemes={} edata_schemes={}) gives graphs[0].ndata as {}. What changes should i make to bypass this

feats = batched_graph.ndata['attr'].float()
logits = model(batched_graph, feats)

Ive tried passing empty/filled tensors and lists but both seem to give some error

mufeili commented 3 years ago

@mufeili Also, my graphs do not have any features associated with them, only the nodes and edges are, so for a graph Graph(num_nodes=410, num_edges=1500, ndata_schemes={} edata_schemes={}) gives graphs[0].ndata as {}. What changes should i make to bypass this

feats = batched_graph.ndata['attr'].float()
logits = model(batched_graph, feats)

Ive tried passing empty/filled tensors and lists but both seem to give some error

  1. If the NetworkX graphs have node features (attributes for NetworkX), you can load them by specifying node_attrs in using dgl.from_networkx
  2. If your graphs do not have node features, you can use features like node degrees or simply learn node embeddings from scratch.
bhavaygg commented 3 years ago

@mufeili i was trying to talk about graph attributes not node attributes.

mufeili commented 3 years ago

@mufeili i was trying to talk about graph attributes not node attributes.

For graph attributes, you can treat them as additional labels and process them in the same way as graph labels.

bhavaygg commented 3 years ago

@mufeili how can i implement kfold validation on dgl graphs.

from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=3,shuffle=True, random_state=1337)
for train, test in kfold.split(data, labels):
  train_data = list(zip(data[train], labels[train]))
  test_data = list(zip(data[test], labels[test]))

Here data is the list of DGL graphs and it throws the error ValueError: only one element tensors can be converted to Python scalars

mufeili commented 3 years ago

@mufeili how can i implement kfold validation on dgl graphs.

from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=3,shuffle=True, random_state=1337)
for train, test in kfold.split(data, labels):
  train_data = list(zip(data[train], labels[train]))
  test_data = list(zip(data[test], labels[test]))

Here data is the list of DGL graphs and it throws the error ValueError: only one element tensors can be converted to Python scalars

What is data and labels? Can you provide a toy example for that? I guess you need to manually implement k-fold cross validation rather than use StratifiedKFold from scikit-learn.

bhavaygg commented 3 years ago
data[0] = DGLGraph(num_nodes=57211, num_edges=136670,
         ndata_schemes={}
         edata_schemes={'norm': Scheme(shape=(), dtype=torch.float32), 'rel_type': Scheme(shape=(17,), dtype=torch.float64)})
labels[0]=torch.tensor([0,1])
mufeili commented 3 years ago
data[0] = DGLGraph(num_nodes=57211, num_edges=136670,
         ndata_schemes={}
         edata_schemes={'norm': Scheme(shape=(), dtype=torch.float32), 'rel_type': Scheme(shape=(17,), dtype=torch.float64)})
labels[0]=torch.tensor([0,1])

How many graphs do you have? What is the shape of labels? Is labels for node classification?

bhavaygg commented 3 years ago

@mufeili There are 551 graphs. Labels is the list of tensors for graph classification so its length is 551.

mufeili commented 3 years ago

Assume we follow the standard practice for developing a custom PyTorch, this needs to be

class Dataset:
    def __init__(self):
        ...

    def __getitem__(self, idx):
        """
        Returns
        --------
        DGLGraph
            The i-th graph.
        labels
            The labels for the i-th datapoint.
        """

    def __len__(self):
        """
        Returns
        --------
        int
            The size for the dataset.
        """

You can then implement k-fold splitting as follows:

import random

class Subset(object):
    """Subset of a dataset at specified indices
    Code adapted from PyTorch.

    Parameters
    ----------
    dataset
        dataset[i] should return the ith datapoint
    indices : list
        List of datapoint indices to construct the subset
    """
    def __init__(self, dataset, indices):
        self.dataset = dataset
        self.indices = indices

    def __getitem__(self, item):
        """Get the datapoint indexed by item

        Returns
        -------
        tuple
            datapoint
        """
        return self.dataset[self.indices[item]]

    def __len__(self):
        """Get subset size

        Returns
        -------
        int
            Number of datapoints in the subset
        """
        return len(self.indices)

def k_fold_split(dataset, k, shuffle=True):
    """
    Parameters
    -----------
    dataset
        An instance for the Dataset class defined above.
    k: int
        The number of folds.
    shuffle: bool
        Whether to shuffle the dataset before performing a k-fold split.

    Returns
    --------
    list of length k
        Each element is a tuple (train_set, val_set) corresponding to a fold.
    """
    assert k >= 2, 'Expect the number of folds to be no smaller than 2, got {:d}'.format(k)
    all_folds = []
    indices = list(range(len(dataset)))
    if shuffle:
        random.shuffle(indices)
    frac_per_part = 1. / k
    data_size = len(dataset)
    for i in range(k):
        val_start = data_size * i * frac_per_part
        val_end = data_size * (i + 1) * frac_per_part
        val_indices = indices[val_start: val_end]
        val_subset = Subset(dataset,  val_indices)
        train_indices = indices[:val_start] + indices[val_end:]
        train_subset = Subset(dataset,  train_indices)
        all_folds.append((train_subset, val_subset))
    return all_folds