WenjieDu / PyPOTS

A Python toolkit/library for reality-centric machine/deep learning and data mining on partially-observed time series, including SOTA neural network models for scientific analysis tasks of imputation/classification/clustering/forecasting/anomaly detection/cleaning on incomplete industrial (irregularly-sampled) multivariate TS with NaN missing values
https://pypots.com
BSD 3-Clause "New" or "Revised" License
951 stars 92 forks source link

Tutorial is not working #431

Closed roijalbaker closed 1 month ago

roijalbaker commented 2 months ago

1. System Info

Using the Google Colab space

2. Information

3. Reproduction

Run this script in Google CoLab

2024-06-10 10:05:03 [INFO]: Have set the random seed as 2204 for numpy and pytorch.
2024-06-10 10:05:03 [INFO]: Loading the dataset physionet_2012 with TSDB (https://github.com/WenjieDu/Time_Series_Data_Beans)...
2024-06-10 10:05:03 [INFO]: Starting preprocessing physionet_2012...
2024-06-10 10:05:03 [INFO]: You're using dataset physionet_2012, please cite it properly in your work. You can find its reference information at the below link: 
https://github.com/WenjieDu/TSDB/tree/main/dataset_profiles/physionet_2012
2024-06-10 10:05:03 [INFO]: Start downloading...
2024-06-10 10:05:14 [INFO]: Successfully downloaded data to /tmp/tmpopu79y31/set-a.tar.gz
2024-06-10 10:05:14 [INFO]: Successfully extracted data to /root/.tsdb/physionet_2012
2024-06-10 10:05:25 [INFO]: Successfully downloaded data to /tmp/tmpkuq3h2j0/set-b.tar.gz
2024-06-10 10:05:26 [INFO]: Successfully extracted data to /root/.tsdb/physionet_2012
2024-06-10 10:05:36 [INFO]: Successfully downloaded data to /tmp/tmpp87bgnsx/set-c.tar.gz
2024-06-10 10:05:37 [INFO]: Successfully extracted data to /root/.tsdb/physionet_2012
2024-06-10 10:05:37 [INFO]: Successfully downloaded data to /root/.tsdb/physionet_2012/Outcomes-a.txt
2024-06-10 10:05:38 [INFO]: Successfully downloaded data to /root/.tsdb/physionet_2012/Outcomes-b.txt
2024-06-10 10:05:38 [INFO]: Successfully downloaded data to /root/.tsdb/physionet_2012/Outcomes-c.txt
2024-06-10 10:05:43 [WARNING]: Ignore 140501, because its len==1, having no time series data
2024-06-10 10:05:48 [WARNING]: Ignore 140936, because its len==1, having no time series data
2024-06-10 10:06:03 [WARNING]: Ignore 141264, because its len==1, having no time series data
2024-06-10 10:06:19 [WARNING]: Ignore 142998, because its len==1, having no time series data
2024-06-10 10:06:19 [WARNING]: Ignore 147514, because its len==1, having no time series data
2024-06-10 10:06:22 [WARNING]: Ignore 150649, because its len==1, having no time series data
2024-06-10 10:06:28 [WARNING]: Ignore 142731, because its len==1, having no time series data
2024-06-10 10:06:28 [WARNING]: Ignore 150309, because its len==1, having no time series data
2024-06-10 10:06:36 [WARNING]: Ignore 145611, because its len==1, having no time series data
2024-06-10 10:06:43 [WARNING]: Ignore 143656, because its len==1, having no time series data
2024-06-10 10:06:52 [WARNING]: Ignore 156254, because its len==1, having no time series data
2024-06-10 10:06:57 [WARNING]: Ignore 155655, because its len==1, having no time series data
2024-06-10 10:07:28 [INFO]: Successfully saved to /root/.tsdb/physionet_2012/physionet_2012_cache.pkl
2024-06-10 10:07:28 [INFO]: Loaded successfully!
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-2-013a7940bff8>](https://localhost:8080/#) in <cell line: 7>()
      5 
      6 # Load the PhysioNet-2012 dataset
----> 7 physionet2012_dataset = gene_physionet2012(artificially_missing_rate=0.1)
      8 
      9 # Take a look at the generated PhysioNet-2012 dataset, you'll find that everything has been prepared for you,

2 frames
[/usr/local/lib/python3.10/dist-packages/pypots/data/load_preprocessing.py](https://localhost:8080/#) in preprocess_physionet2012(data)
     28     data["static_features"].remove("ICUType")  # keep ICUType for now
     29     # remove the other static features, e.g. age, gender
---> 30     X = data["X"].drop(data["static_features"], axis=1)
     31 
     32     def apply_func(df_temp):  # pad and truncate to set the max length of samples as 48

KeyError: 'X'

4. Expected behavior

I expect no error in the second cell of the notebook

github-actions[bot] commented 2 months ago

Hi there 👋,

Thank you so much for your attention to PyPOTS! You can follow me on GitHub to receive the latest news of PyPOTS. If you find PyPOTS helpful to your work, please star⭐️ this repository. Your star is your recognition, which can help more people notice PyPOTS and grow PyPOTS community. It matters and is definitely a kind of contribution to the community.

I have received your message and will respond ASAP. Thank you for your patience! 😃

Best, Wenjie

WenjieDu commented 2 months ago

@roijalbaker This error is caused by a recent TSDB update. To make things work, you should install tsdb==0.3.1, rather than v0.4.

V50-ikun commented 2 months ago

@WenjieDu 我也遇到了同样的问题,安装了tsdb==0.3.1还是报错

V50-ikun commented 2 months ago

@WenjieDu 我print了key值,如图所示 Snipaste_2024-06-10_22-08-55

WenjieDu commented 2 months ago

@WenjieDu 我也遇到了同样的问题,安装了tsdb==0.3.1还是报错

@V50-ikun After downgrading from v0.4 to v0.3.1, please remove all files with .pkl suffix under ~/.tsdb. You can run with the command rm -rf ~/.tsdb/*/*.pkl, or directly delete ~/.tsdb the whole directory.

V50-ikun commented 2 months ago

@WenjieDu 问题解决了,非常感谢!

Gabriel-Lucena96 commented 2 months ago

Sadly, i've installed v 0.3.1 and there isn't any file '.pkl' in tsbd. Nevertheless, same issue is occouring.

WenjieDu commented 2 months ago

@Gabriel-Lucena96 It works fine. Take a look at the below screenshot.

image
Gabriel-Lucena96 commented 2 months ago

@WenjieDu, thank you. It works just fine! One more thing I'd like to know... what's the difference among 'test_X', 'test_X_ori' and 'test_X_indicating_mask' (boolean matrix)?

WenjieDu commented 2 months ago

@Gabriel-Lucena96 no problem. If PyPOTS is useful to you, please start 🌟 our repositories to help more people notice our work. You can also follow me on GitHub to receive the latest news from PyPOTS in the future.

X_ori stands for X original, which is from original X with of course original missing values if any. X is X_ori added with additional artificial missing data, X is for model input, X_ori is for error calculation and model validation, indicating_mask tells the difference between X and X_ori, i.e. which part is artificially masked out.

Gabriel-Lucena96 commented 2 months ago

Already gave 🌟 and followed :) I'm sorry the insistance, just trying to understand the data... But specifically obout physionet2012, original data has no missing values, correct? I checked:

print(np.isnan(data['test_X_ori']).any()) # False

If I understood correctly, the mask should only be 'True' if it is a added missing value (difference between original and test). Nevertheless, I've checked

bool_ori = np.isnan(data['test_X_ori'])
bool_test = np.isnan(data['test_X'])

MASK = ~(bool_ori == bool_test)  # Only True when added missing data
equal = (data['test_X_indicating_mask'] == MASK)  # Should be all True
print(equal.all()) # False

and it didn't matchup. Am I getting the wrong idea?

Ohh, and have you tested your algorithm in univariated TS?

WenjieDu commented 2 months ago

@Gabriel-Lucena96 physionet2012 original data has ~80% missing data. In the old version of gene_physionet2012(), we gene indicating mask then fill X_ori with 0, hence your testing has no problem. We use metric functions in pypots.utils.metrics that you can refer to.