Train-Val-Test split does not mantain connectivity

mminici commented 6 months ago

Describe the bug When creating a training-validation-test split for a dataset, some edges in the validation/test have nodes that do not appear in the training.

To Reproduce

import torch

from torch_geometric_signed_directed.data import load_signed_real_data

# configuration
seed = 0
dataset_name = 'bitcoin_otc'

# Load data using torch geometric signed directed data loader
data = load_signed_real_data(dataset=dataset_name)

# Create several train, val, test splits

signed_datasets = data.link_split(prob_val=0.1,
                                  prob_test=0.1,
                                  task='sign',
                                  maintain_connect=True,
                                  seed=seed,
                                  splits=1)

# check that all nodes in validation and test have at least an edge in the training set
for split_id in signed_datasets:
    val_nodes = torch.unique(torch.flatten(signed_datasets[split_id]['val']['edges']))
    for node_id in val_nodes:
        node_id_mask = torch.logical_or(signed_datasets[split_id]['train']['edges'][:, 0] == node_id,
                                        signed_datasets[split_id]['train']['edges'][:, 1] == node_id)
        assert node_id_mask.sum().item() > 0, f'[VAL] node id: {node_id} has no incident edges in training set'
    test_nodes = torch.unique(torch.flatten(signed_datasets[split_id]['test']['edges']))
    for node_id in test_nodes:
        node_id_mask = torch.logical_or(signed_datasets[split_id]['train']['edges'][:, 0] == node_id,
                                        signed_datasets[split_id]['train']['edges'][:, 1] == node_id)

        assert node_id_mask.sum().item() > 0, f'[TEST] node id: {node_id} has no incident edges in training set'

Expected behavior We expect all nodes in the validation and test set (i.e., nodes being either source OR target of an edge) to appear at least once in the training edges since maintain_connect parameter is True.

Additional context Since we need to compute a feature vector for each node, we need all nodes in the validation/test to appear at least once in the training otherwise it is not possible to compute a feature vector for them.

mminici commented 6 months ago

Another thing which is not clear is that the sum of training, validation and test edges is not equal to the amount of edges prior the training/validation/test split:

for split_id in signed_datasets:
    num_split_edges = signed_datasets[0]['train']['label'].shape[0]
    num_split_edges += signed_datasets[0]['val']['label'].shape[0]
    num_split_edges += signed_datasets[0]['test']['label'].shape[0]
    print(data.edge_weight.shape[0] == num_split_edges)

SherylHYX commented 6 months ago

Hello @mminici, thank you for your comments. The first issue is now fixed with my newest pull request. For the second concern, this is expected as we would remove those edges that do not belong to any of the classes of interest for training, testing, or validation. Hope the above helps!

mminici commented 6 months ago

Hello @SherylHYX, it is my pleasure to contribute to your great open-source project. I don't understand your answer. The split strategy is "sign", so the classes of interest are {positive, negative} why should I remove edges?

SherylHYX commented 6 months ago

For link sign prediction specifically, I get "True" as the output of your code snippet. Could you upgrade the installation to our latest version to see if there is still an issue here @mminici ? Thank you!

mminici commented 6 months ago

With the 0.23.0 version of the library, I can say my issue is resolved.

SherylHYX / pytorch_geometric_signed_directed

Train-Val-Test split does not mantain connectivity #57