TorchSpatiotemporal / tsl

tsl: a PyTorch library for processing spatiotemporal data.
https://torch-spatiotemporal.readthedocs.io/
MIT License
236 stars 22 forks source link

[Improving Documentation] Contributing inspectable notebook for imputation on custom dataset #25

Open b2jia opened 1 year ago

b2jia commented 1 year ago

Thank you for this amazing resource! Like others have raised in other issues, it seems:

In addition to those two points, I've also noticed that:

As a complete outsider to GNNs, I am wondering if I could get the authors' help in getting feedback on creating an example for beginners. In this way, I am hoping to contribute to the documentation, such that even a complete novice (such as I) can get started using tsl.

For instance, I have been thinking - say there is dataset of car trajectories, collected over time. How can we go from a dataframe (shown below), to training a model in tsl to predict the missing positions x, y, z?

import numpy as np
import pandas as pd

# Define number of trajectories and time points
num_traj = 5
num_timepoints = 10

# Generate random trajectories
data = pd.DataFrame(np.random.randn(num_traj*num_timepoints, 4), columns=['x', 'y', 'z', 't'])

# Assign trajectory ID for each time point
data['trajectory'] = np.repeat(np.arange(num_traj), num_timepoints)

# Set some values to NaN to represent missing positions
data.iloc[np.random.choice(data.index, size=10, replace=False), :3] = np.nan

# Set timepoints to positive integers and the same for all instances of each trajectory
for traj_id in range(num_traj):
    traj_data = data.loc[data['trajectory'] == traj_id]
    traj_data['t'] = np.arange(num_timepoints)
    data.loc[data['trajectory'] == traj_id] = traj_data
marshka commented 1 year ago

Hi, thank you for your passionate interest in our project! Contributions are indeed very welcome.

Mind that tsl is meant to deal with spatiotemporal data, so yes in principle data coming from sensor networks. Typically such data have 3 dimensions: time, space (i.e., sensors/nodes), and features (thus accommodating multivariate sensor observations).

In your case, are the cars synchronously moving in the same space? Or do you rather have a collection of unrelated time series, each of which is a single-car trajectory constituting a single sample in the dataset?

tsl-like datasets are designed to model the first scenario, in which the different time series are synchronous and somewhat connected. For the second case, we could think of another solution that can bypass all the burden deriving from the sliding-window functionalities in the TabularDataset. As far as I know, @LucaButera is working on something similar and can be of help.

b2jia commented 1 year ago

@marshka Thanks for this response! I see. Indeed in my problem I have exactly as you put it - "a collection of unrelated time series, each of which is a single-car trajectory constituting a single sample in the dataset".

To start simply, I am thinking - is it possible to aid the imputation by passing in as edge attributes the original, complete distance matrix between positions over time? In other words, if I know how far apart the missing point should be relative to known positions, can I recover the missing position? Then, if this works, perhaps a harder problem is to solve without the aid of this distance matrix. I would love to know what @LucaButera is working on that might be able to help with this type of problem!