[question] link prediction - how to formulate the dataloader in a non-regression way

geoHeil commented 3 years ago

How can I perform spatio-temporal link prediction on an unweighted graph with features on the edges?

My graph is a DynamicGraphTemporalSignal, but I do not want to regress on some signal, rather perform link prediction as also outlined in https://pytorch-geometric-temporal.readthedocs.io/en/latest/_modules/torch_geometric_temporal/nn/recurrent/gc_lstm.html?highlight=link

But from looking at the example code of: https://github.com/benedekrozemberczki/pytorch_geometric_temporal/blob/4fb610326811906eb4f95152a80f5df3f62fb459/test/recurrent_test.py#L305 it is still unclear to me how to create a suitable data loader.

So far I have looked into all the available ones from https://github.com/benedekrozemberczki/pytorch_geometric_temporal/tree/master/torch_geometric_temporal/dataset and think that the Twitter tennis data loader https://github.com/benedekrozemberczki/pytorch_geometric_temporal/blob/master/torch_geometric_temporal/dataset/twitter_tennis.py is the most suitable one.

However, even this one seems to be stuck in formulating a regression problem.

NOTICE: even though the edge has features such as location, time, and hobby when making a prediction I am happy when only predicting a new suitable link from a topological perspective even if the features such as time, location, or hobby would not be required to match the ground truth data when evaluating a match on the test dataset.

Please find the code to produce the dummy data below. It also contains the transformations I have come up with so far from looking at existing data loaders.

But for me, it remains unclear how to structure the data loader for a link prediction problem and not for a regression.

  import pandas as pd
  import numpy as np

  n_people = 20
  links = 80
  seed = 47
  time_steps = 10
  hobby_categories = 3

  np.random.seed(47)

  df_dummy = pd.DataFrame({'person_1':np.random.randint(0, n_people-1, links), 
                           'person_2':np.random.randint(0, n_people-1, links), 
                           'time':np.random.randint(0, time_steps-1, links),
                           'lat': np.random.uniform(10,19, links), 
                           'lng':np.random.uniform(45,49, links), 
                           'hobby':np.random.randint(0, hobby_categories-1, links)})
  display(df_dummy.head())

  df_dummy['edge_index'] = df_dummy[['person_1', 'person_2']].values.tolist()
  df_dummy['edge_features'] = df_dummy[['lat', 'lng', 'hobby']].values.tolist()
  df_dummy = df_dummy[['time', 'edge_index', 'edge_features']]
  display(df_dummy.head())

  df_dummy = df_dummy.groupby(['time']).agg(lambda x: list(x))
  display(df_dummy.head())

edit

I guess somehow the target must be the lagged edge-index of a future time index? Perhaps even of any future time index? depending on the forecast horizon.

Perhaps something along these lines would work? Here in this example a prediction horizon with an unlimited boundary is assumend.

current_timestep = 3
# get the n+1
future_edges_target = pd.DataFrame(df_dummy.loc[current_timestep +1:].edge_index).explode('edge_index').edge_index#.unique()

# as the graph is undirected, sorting the edges is fine.
future_edges_target = future_edges_target.apply(sorted).transform(tuple).unique()
future_edges_target

array([(8, 8), (1, 7), (7, 11), (2, 16), (5, 18), (11, 18), (6, 11),
       (2, 5), (14, 16), (6, 6), (16, 18), (9, 17), (1, 2), (0, 6),
       (0, 5), (0, 15), (9, 18), (9, 12), (1, 17), (2, 14), (8, 13),
       (1, 18), (1, 8), (14, 17), (5, 6), (3, 6), (11, 14), (10, 17),
       (4, 14), (7, 12), (0, 18), (13, 15), (9, 15)], dtype=object)

geoHeil commented 3 years ago

Is the type of the graph (DynamicGraphTemporalSignal) even suitable?

The feature set and node labels (target) are also dynamic. https://github.com/benedekrozemberczki/pytorch_geometric_temporal/blob/a13ea7876525ed9ba7c48ec69408024346eaded3/torch_geometric_temporal/signal/dynamic_graph_temporal_signal.py#L14

This rather sounds to me as if only some classification/regression would work out of the box? I hope this is wrong and my formulation of future edges as the target given a specific timestamp would work as well.

benedekrozemberczki commented 3 years ago

You could define a custom loss on edge pairs that you look up from the edge features or have some running temp variable for lagged states. I cannot see what is the exact problem that you are facing with this, you can keep two snapshots in memory and iteratively replace them and predict for the second one e.g. queue.

geoHeil commented 3 years ago

Some publications refer here to a regularized L2 distance.

My problem is: being very new to deep learning and in particular geometric deep learning everything is still a bit confusing - especially starting out with something which is not covered in the starter examples with so many new APIs and layers of APIs.

For the loss function something simple along these lines should be fine to start out with:

import pandas as pd
dummy_loss = pd.DataFrame({'src':[1,2], 'dst':[3,4], 'src_pred':[2,2], 'dst_pred':[3,4],})
# undirected - inverse matches also count
dummy_loss.loc[((dummy_loss.dst  == dummy_loss.src_pred) & (dummy_loss.src  == dummy_loss.dst_pred)) | ((dummy_loss.src  == dummy_loss.src_pred) & (dummy_loss.dst  == dummy_loss.dst_pred)), 'is_correct'] = 1

dummy_loss.is_correct = dummy_loss.is_correct.fillna(0)
display(dummy_loss)
dummy_loss.is_correct.value_counts(normalize=True)

You could define a custom loss on edge pairs that you look up from the edge features or have some running temp variable for lagged states

Sure, but for me, it is unclear if this should go somehow into the data loader or somewhat implicitly into the training loop.

Your data loaders for regression always have a tight relationship between the features and targets:

        stacked_target = np.array(self._dataset["y"])
        standardized_target = (stacked_target - np.mean(stacked_target, axis=0)) /(np.std(stacked_target, axis=0)+10**-10)
        self.features = [standardized_target[i:i+self.lags,:].T for i in range(self._dataset["time_periods"]-self.lags)]
        self.targets = [standardized_target[i+self.lags,:].T for i in range(self._dataset["time_periods"]-self.lags)]

How could it work that edges are predicted instead of given edges (with weights) and nodes (with temporal features)? I.e. currently there seems to be a disconnect between features (temporal node or edge features) and target (the topological edges in a future time step).

Most other approaches I have seen so far for edge prediction seem to utilize negative sampling & a temporal random walk/embedding and then binary classification. I am unsure if this would be needed in the case of PyTorch geometric temporal as well - and if yes, where to fit it into the framework. Should it go into the data loader?

benedekrozemberczki commented 3 years ago

Added an integer based index to all of the iterators and created a new release (0.33) which allows direct indexing of snapshots.

benedekrozemberczki / pytorch_geometric_temporal

[question] link prediction - how to formulate the dataloader in a non-regression way #62

edit