benedekrozemberczki / pytorch_geometric_temporal

PyTorch Geometric Temporal: Spatiotemporal Signal Processing with Neural Machine Learning Models (CIKM 2021)
MIT License
2.61k stars 367 forks source link

What type of signal object, and which shapes to adopt for the indices/features/targets? #222

Closed LaurentBerder closed 1 year ago

LaurentBerder commented 1 year ago

Hi,

First of all, thanks a lot for all the work you've put in the library, it's really amazing.

I'm working on a project with a set of edges and nodes both containing features, with static node features (their values are constant over time), but with dynamic edge features (their values change over time). And the target nodes are also dynamic: I'm planning a node feature regression over a few timesteps in the future.

I'm having a hard time figuring out what kind of signal would suit my needs best. I see that ChickenpoxDatasetLoader is not appropriate as everything is static there, so probably more something in the line of METRLADatasetLoader which is a StaticGraphTemporalSignal.

So I tried taking inspiration from this example which uses METRLA. But it only seems to take node data into account and I see no mention of the edges and their potential features.

I saw in #84 an interesting rewrite of the dataloader, so tried myself out and came up with the following:

class CLoader(object):
    def __init__(self, edges, features, targets, nb_node_features=3, nb_edge_features=2):
        super(CongestLoader, self).__init__()
        self.edges = edges
        self.features = features
        self.targets = targets
        self.nb_node_features = nb_node_features
        self.nb_edge_features = nb_edge_features
        self.nb_nodes = len(self.features.PORT.unique())
        print("Ready for 'get_dataset' method")

    def _setting_timeframe(self):
        print("Getting time bounds")
        # Look at all the dates present in the data
        self.days = pd.concat([pd.DataFrame(self.edges.DAY.unique()),
                        pd.DataFrame(self.features.DAY.unique()),
                        pd.DataFrame(self.targets.DAY.unique())])\
                .rename(columns={0: 'DAY'})\
                .drop_duplicates()\
                .sort_values('DAY')
        # Count how many days between each date (should normally be 1)
        self.days['difference'] = self.days.DAY.diff().dt.days
        if len(self.days[~pd.isna(self.days.difference) & self.days.difference > 1].index) > 0:
            Warning("Some of the timesteps are missing!")

        # Limit data to fully available data
        self.time_limits = self.days[(self.days.DAY >= self.days.DAY.min() + timedelta(days=self.days_in)) & (self.days.DAY <= self.days.DAY.max() - timedelta(days=self.days_out))]
        self.lowest_day = self.time_limits.DAY.min()
        self.highest_day = self.time_limits.DAY.max() - timedelta(days=1)
        self.nb_batch = len(self.time_limits.index)

    def _get_edges(self):
        self.edge_indices = []
        self.edge_features = []
        # Setting edge features in shape (nb_snapshot, nb_edge_features, nb_edges)
        # And     edge indices  in shape (nb_snapshot, 2, nb_edges)
        for day in tqdm(days[days.DAY.between(lowest_day, highest_day)].DAY, desc="Formatting edges"):
            batch = edge_features[edge_features.DAY == day]
            batch_index = batch[['origin_encoded', 'destination_encoded']].values.reshape(2, len(batch.index))
            batch_feat = batch[['transit', 'distance']].values.reshape(nb_edge_features, len(batch.index))
            self.edge_indices.append(batch_index)
            self.edge_features.append(batch_feat)
        self.edge_indices = np.array(self.edge_indices)
        self.edge_features = np.array(self.edge_features)

        print('Got edge_indices: --> shape{s}'.format(s=self.edge_indices.shape))
        print('Got edge_features: --> shape{s}'.format(s=self.edge_features.shape))

    def _get_targets_and_features(self):
        self.node_features = []
        self.node_targets = []
        # Setting node features in shape (nb_snapshot, nb_nodes, nb_node_features, days_in)
        # And     node targets  in shape (nb_snapshot, nb_nodes, days_out)
        for day in tqdm(days[days.DAY.between(lowest_day, highest_day)].DAY, desc="Formatting nodes"):
            country_batch = self.features[self.features.DAY.between(day - timedelta(days=self.days_in), day - timedelta(days=1))]\
                                    .pivot_table(index='node', columns='DAY', values='country', dropna=False).values
            continent_batch = self.features[self.features.DAY.between(day - timedelta(days=self.days_in), day - timedelta(days=1))]\
                                    .pivot_table(index='node', columns='DAY', values='continent', dropna=False).values
            region_batch = self.features[self.features.DAY.between(day - timedelta(days=self.days_in), day - timedelta(days=1))]\
                                    .pivot_table(index='node', columns='DAY', values='region', dropna=False).values
            feature_batch = np.array([country_batch, continent_batch, region_batch]).reshape(self.nb_nodes, self.nb_node_features, self.days_in)
            target_batch = self.targets[self.targets.DAY.between(day + timedelta(days=1), day + timedelta(days=self.days_out))]\
                                    .pivot_table(index='node', columns='DAY', values='target', dropna=False).values
            self.node_features.append(feature_batch)
            self.node_targets.append(target_batch)
        self.node_features = np.array(self.node_features)
        self.node_targets = np.array(self.node_targets)

        print('Got node_features: --> shape{s}'.format(s=self.node_features.shape))
        print('Got node_targets: --> shape{s}'.format(s=self.node_targets.shape))

    def prepare_data(self):
        self._setting_timeframe()
        self._encode_variables()
        self._get_edges()
        self._get_targets_and_features()

    def get_dataset(self, days_in=30, days_out=30):
        self.days_in = days_in
        self.days_out = days_out
        self.prepare_data()
        dataset = torch_geometric_temporal.DynamicGraphTemporalSignal(edge_indices=self.edge_indices, edge_weights=self.edge_features, features=self.node_features, targets=self.node_targets)
        return dataset

loader = CLoader(edges=edge_features, features=node_features, targets=node_targets, nb_node_features=3, nb_edge_features=2)
data = loader.get_dataset(days_in=30, days_out=30)
Getting time bounds
Encoding variables
Formatting edges: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1843/1843 [02:39<00:00, 11.58it/s]
Got edge_indices: --> shape(1843, 2, 13338)
Got edge_features: --> shape(1843, 2, 13338)
Formatting nodes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1843/1843 [03:40<00:00,  8.35it/s]
Got node_features: --> shape(1843, 715, 3, 30)
Got node_targets: --> shape(1843, 715, 30)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 90, in get_dataset
  File "python3.7/site-packages/torch_geometric_temporal/signal/dynamic_graph_temporal_signal.py", line 46, in __init__
    self._check_temporal_consistency()
  File "python3.7/site-packages/torch_geometric_temporal/signal/dynamic_graph_temporal_signal.py", line 58, in _check_temporal_consistency
    ), "Temporal dimension inconsistency."
AssertionError: Temporal dimension inconsistency.

As you can see, I have the following figures:

I cannot seem to find information on which shape to construct my nodes & edge objects for them to be inputs to a Signal object.

My intention is building a signal where each time step is a torch_geometric graph, like so:

from torch_geometric.data import Data
Data(x=loader.node_features[0,:,:,:], edge_index=loader.edge_indices[0,:,:], edge_attr=loader.edge_features[0,:,:], y=loader.node_targets[0,:,:])
LaurentBerder commented 1 year ago

I realize that my mistake was to combine everything in numpy arrays, while it should remain lists of arrays.

Deleting the lines in _get_edges() and _get_targetsand_features() where I combined the lists into arrays, I get rid of the error when building the DynamicGraphTemporalSignal.