WenjieDu / PyPOTS

A Python toolkit/library for reality-centric machine/deep learning and data mining on partially-observed time series, including SOTA neural network models for scientific analysis tasks of imputation/classification/clustering/forecasting/anomaly detection/cleaning on incomplete industrial (irregularly-sampled) multivariate TS with NaN missing values
https://pypots.com
BSD 3-Clause "New" or "Revised" License
1.04k stars 99 forks source link

How can I customize my own dataset to fit PyPOTS SOTA imputation models? #141

Closed abhishekju06 closed 1 year ago

abhishekju06 commented 1 year ago

1. Feature description

I want to run Pypots SOA models for my own dataset.

2. Motivation

I have a multivariate dataset and want to check how PyPots models are working on it for data imputation.

3. Your contribution

None so far

WenjieDu commented 1 year ago

Hi there ๐Ÿ‘‹,

Thank you so much for your attention to PyPOTS! If you find PyPOTS helpful to your work, please starโญ๏ธ this repository. Your star is your recognition, which can help more people notice PyPOTS and grow PyPOTS community. It matters and is definitely a kind of contribution to the community.

I have received your message and will respond ASAP. Thank you for your patience! ๐Ÿ˜ƒ

Best, Wenjie

WenjieDu commented 1 year ago

Hi, thank you for raising this issue. The only thing you need to do is, after your data preprocessing, ensure the shape of your data input into models has 3 dimensions [n_samples, n_steps, n_features].

abhishekju06 commented 1 year ago

What does n_steps indicate in my dataset? n_features represent the number of attributes, I suppose. n_samples represent the len of dataframe, I suppose.

Is Data preprocessing consists of: Cleaning & Normalization only?

WenjieDu commented 1 year ago

n_samples indicates how many samples are in your dataset. n_steps is the number of time steps in each sample. You can use sliding window algo to generate such a 3D dataset from your original 2D dataset.

Yes, of course, cleaning and normalization are included in preprocessing. You know, machine learning is not magic, you have to make things prepared for model processing.

abhishekju06 commented 1 year ago

In my case number of time steps in each sample is same as length of dataframe. Can you give me a reference of sliding window algo for generation of the 3D dataset? It would be of great help.

WenjieDu commented 1 year ago

Please try simply to search with google or github, I believe you can figure it out fast. This is not a complicated algorithm, but just a simple method.

abhishekju06 commented 1 year ago

Thanks a lot!

WenjieDu commented 1 year ago

My pleasure! @abhishekju06 Just remembered that you can find such a sliding window function from data-processing utilities in SAITS repo here. If you using SAITS model for your data imputation and think it's helpful, please kindly consider to star ๐ŸŒŸ SAITS repo to make more people notice this useful model. Many thanks!

abhishekju06 commented 1 year ago

n_samples indicates how many samples are in your dataset. n_steps is the number of time steps in each sample. You can use sliding window algo to generate such a 3D dataset from your original 2D dataset.

Can you please help me understand 1) n_samples 2) n_steps 3) sequence length by an example.

======================== I have created the dataset: 2023-06-13 17:41:40,422 - Already masked out 10.0% values in train set 2023-06-13 17:41:40,475 - In val set, num of artificially-masked values: 7917.0 2023-06-13 17:41:40,475 - In test set, num of artificially-masked values: 7244.0 2023-06-13 17:41:40,476 - Feature num: 3, 7805 (0.936) samples in train set 281 (0.034) samples in val set 257 (0.031) samples in test set 2023-06-13 17:41:40,496 - All done.

==================== Below is my code:

import h5py f = h5py.File('datasets.h5') f.keys()

for key in f.keys(): print(key) #Names of the root level object names in HDF5 file - can be groups or datasets. print(type(f[key])) # get the object type: usually group or dataset ################# group_train = f['train']

for key in group_train.keys(): print("Train:",key)

dataset_for_training = { "X": group_train['X'][()], } ############################# group_val = f['val'] for key in group_val.keys(): print("Val:",key)

dataset_for_validating = { "X": group_val['X'][()], "X_intact": group_val['X_hat'][()], "indicating_mask": group_val['indicating_mask'][()], } ############################# group_test = f['test'] for key in group_test.keys(): print("Test:",key)

dataset_for_testing = { "X":group_test['X'][()], }

from pypots.optim import Adam from pypots.imputation import SAITS

saits = SAITS( n_steps= 100,#physionet2012_dataset['n_steps'], n_features = 3,#physionet2012_dataset['n_features'], n_layers=2, d_model=256, d_inner=128, n_heads=4, d_k=64, d_v=64, dropout=0.1, attn_dropout=0.1, diagonal_attention_mask=True, # otherwise the original self-attention mechanism will be applied ORT_weight=1, # you can adjust the weight values of arguments ORT_weight

and MIT_weight to make the SAITS model focus more on one task. Usually you can just leave them to the default values, i.e. 1.

MIT_weight=1,
batch_size=32,
# here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
epochs=10,
# here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
# You can leave it to defualt as None to disable early stopping.
patience=3,
# give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
# initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
optimizer=Adam(lr=1e-3),
# this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
# Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
# You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
num_workers=1,
# just leave it to default, PyPOTS will automatically assign the best device for you.
# Set it to 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices.
device='cuda',  
# set the path for saving tensorboard and trained model files 
saving_path="C:/Users/e264642/WFD_Projects/IITB/IITB_Code/pots/saits",
# only save the best model after training finished.
# You can also set it as "better" to save models performing better ever during training.
model_saving_strategy="best",

)

Training

saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

I am getting this error: 2023-06-13 17:54:11 [INFO]: Model initialized successfully with the number of trainable parameters: 1,321,802 2023-06-13 17:54:26 [INFO]: epoch 0: training loss 0.3622, validating loss nan 2023-06-13 17:54:42 [INFO]: epoch 1: training loss 0.2156, validating loss nan 2023-06-13 17:54:59 [INFO]: epoch 2: training loss 0.1777, validating loss nan 2023-06-13 17:54:59 [INFO]: Exceeded the training patience. Terminating the training procedure... Traceback (most recent call last):

File "C:\Users\1234\AppData\Local\Temp\ipykernel_13552\1758113713.py", line 41, in saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

File "C:\Users\1234\Anaconda3\envs\pypots\lib\site-packages\pypots\imputation\saits\model.py", line 420, in fit self._train_model(training_loader, val_loader)

File "C:\Users\1234\Anaconda3\envs\pypots\lib\site-packages\pypots\imputation\base.py", line 352, in _train_model if np.equal(self.best_loss.item(), float("inf")):

AttributeError: 'float' object has no attribute 'item'

Please help!

WenjieDu commented 1 year ago

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

abhishekju06 commented 1 year ago

Should there be NaN present in my input dataset before I generate the dataset i.e., can my attribute columns contain NaN values?

WenjieDu commented 1 year ago

Datasets are OK with missing values, of course, PyPOTS is designed for datasets with missing data. But after generation, indicating_mask and X_intact should not have NaNs, and the missing part in X_intact should be imputed with some values like 0. Because PyPOTS will use them for loss calculation. NaN in indicating_mask or X_intact will result in NaN loss, just like in your case.

abhishekju06 commented 1 year ago

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

So NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"] is necessary. right?

WenjieDu commented 1 year ago

Datasets are OK with missing values, of course, PyPOTS is designed for datasets with missing data. But after generation, indicating_mask and X_intact should not have NaNs, and the missing part in X_intact should be imputed with some values like 0. Because PyPOTS will use them for loss calculation. NaN in indicating_mask or X_intact will result in NaN loss, just like in your case.

Sorry for missing a not in my last reply, just fixed it: "after generation, indicating_mask and X_intact should not have NaNs".

abhishekju06 commented 1 year ago

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

I have replaced NaN values with 0 in the 'indicating_mask'. I guess, X_hat refers to X_intact. Thus, I have set it as the value for the key X_intact:

dataset_for_validating = { "X": group_val['X'][()], "X_intact": group_val['X_hat'][()], "indicating_mask": group_val['indicating_mask'][()], }

If not please let me know what X_intact stands for.

In SAITS/dataset_generating_scripts /data_processing_utils.py

in line 86 X_hat[indices_for_holdout] = np.nan # X_hat contains artificial missing values

It is evident that X_hat must contain NaN as it represents artificial missing values.

So where I am going wrong?

WenjieDu commented 1 year ago

Please read the paper first https://arxiv.org/abs/2202.08516. Thanks.

abhishekju06 commented 1 year ago

Hi, Please don't get me wrong. I have read the paper. I just want to make clear what the notations in the paper corresponds with the notation in the code. X cap M cap X tilde I

Without your help it is not possible for me to understand.

WenjieDu commented 1 year ago

Please read it carefully and take a look at the model's implementation code here for reference.

abhishekju06 commented 1 year ago

Thanks a ton!

WenjieDu commented 1 year ago

No problem. If you have further questions regarding the SAITS model, you're welcome to raise issues in SAITS repo https://github.com/WenjieDu/SAITS/issues.