TorchSpatiotemporal / tsl

tsl: a PyTorch library for processing spatiotemporal data.
https://torch-spatiotemporal.readthedocs.io/
MIT License
257 stars 24 forks source link

Questions about masking and mask dependencies during train/val phases #38

Closed donofiva closed 3 weeks ago

donofiva commented 3 months ago

Hello everyone,

I am currently using TorchSpatiotemporal to conduct experiments for my Master's thesis in Data Science and Engineering under the supervision of Professor Paolo Garza.

The dataset I am working with is the SDPWF dataset, which was the main subject of the Baidu KDD competition in 2022. This dataset comprises data from over 100 sensors (wind turbines), recording approximately 10 different channels every ten minutes for 245 days. My task involves performing forecasting on this data. The objective is to compare various spatial-temporal deep learning architectures to understand how incorporating spatial information can improve prediction accuracy.

I have set up the necessary features and initialized the SpatioTemporalDataset and SpatioTemporalDataModule classes. Additionally, I have configured the Predictor and Trainer environment (see my Colab notebook here). I successfully trained a GraphWaveNetModel on this data by creating an SDPWFDataset class extending DatetimeDataset. The input dataframes are formatted correctly, with a datetime Pandas index representing the temporal dimension and a multi-column index mapping each wind turbine to its recorded channels. I also generate a dataset mask, a boolean dataframe indicating data availability for specific timeslots and wind turbines.

I am seeking clarification on the dataset mask, as I couldn't find much information in the documentation or GitHub repository. My specific questions are:

I have summarized the issue here, but please feel free to ask for additional details if needed. Feel also free to correct any misunderstanding here. Thank you for your support!

Best regards,

andreacini commented 3 months ago

Hi @donofiva

  1. In forecasting models, by default, the mask is simply used to avoid computing the loss on data points that are missing.
  2. Adding the mask as a covariate is the safest option if you just want it to be concatenated to the input. Note that if you add a covariate called mask this would be overridden by the default mask of the dataset (which is aligned with respect to the forecasting horizon), you can use a different name to avoid this (e.g., input_mask).
  3. This is done automatically as long as you include a mask attribute in your dataset and feed it to the appropriate modules.
  4. You can specify the mask to filter out any value you want. As already mentioned, any value for which the mask is set to 0 is filtered out when computing the loss, if you want to do more complex operations you would have to extend each model and add mask as an input.

I hope this helps.

Andrea