THUMNLab / AutoGL

An autoML framework & toolkit for machine learning on graphs.
http://mn.cs.tsinghua.edu.cn/AutoGL/
Apache License 2.0
1.09k stars 119 forks source link

Issues when loading personal data #47

Closed SimonCrouzet closed 3 years ago

SimonCrouzet commented 3 years ago

Hello, and thanks for this great autoML framework!

I'm encountering some issues when I try to use my own data with AutoGL. I have a list of PyTorch Geometric's Data objects, I tried to initialise a dataset using your doc, but it doesn't have the same behaviour than prebuilt datasets with random_splits_mask_class():

# Import my own dataset
data_list = [Data(...), ..., Data(...)]
class MyDataset(InMemoryDataset):
    def __init__(self, datalist) -> None:
        super().__init__()
        self.data, self.slices = self.collate(datalist)
myData = MyDataset(data_list)

# Use a prebuilt dataset
cora_dataset = build_dataset_from_name('cora')

# Trying to use AutoNodeClassifier
solver = AutoNodeClassifier(
    feature_module='deepgl',
    graph_models=['gcn', 'gat'],
    hpo_module='anneal',
    ensemble_module='voting',
    device=device
)
solver.fit(myData, train_split=0.8, val_split=0.2, time_limit=3600)

I obtained an error AssertionError: the total number of samples from every class used for training and validation is larger than the total samples in class 0 with myData, which I believe is coming from a different dataset structure:

# line 119 and 129 of autogl/datasets/utils.py
data = dataset[0]
num_classes = data.y.max().cpu().item() + 1

print(cora_dataset)
# Cora()
print(cora_dataset[0])
# Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
print(cora_dataset.data)
# Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])

print(myData)
# MyDataset(375)
print(myData[0])
# Data(edge_index=[2, 264], idx=[1], pair="XXX_X--XXX_X", x=[39, 24], y=[1])
print(myData.data)
# Data(edge_index=[2, 123456], idx=[375], pair=[375], x=[13950, 24], y=[375])

(pair is a custom metadata from my data)

Am I missing something, or is there an unexpected behaviour from the suggested MyDataset?

Thanks in advance,

Frozenmad commented 3 years ago

hi @SimonCrouzet , thanks for using our AutoGL! I see that your dataset contains multiple graphs and each graph has only one label. Are you trying to perform graph classification? If so, you should try using AutoGraphClassifier instead. Currently he node classifier is designed for classifiying on dataset with one graph inside (transductive mode) only. Let me know if you have other concerns :)

Frozenmad commented 3 years ago

Closing because of too long time no response