MaxBenChrist / tspreprocess

A Python package to preprocess time series
MIT License
60 stars 10 forks source link

Lexicographical sort of column "time" after compression #7

Open nikhase opened 7 years ago

nikhase commented 7 years ago

The "time" shows bins and is encoded as bin_0.0. This makes it hard to sort by the column and make plot. What about renaming "time" to "bin" and providing bin numbers?

In general, one would like to pass the dataframe to tsfresh, so the "time" column should be ordered accordingly.

id feature_agg_autocorrelation_fagg"mean" feature_agg_autocorrelation_fagg"median" feature_agg_autocorrelation_fagg"var" time
0 -0.006695 -0.031946 0.031041 bin_0.0
0 0.003307 0.002723 0.015377 bin_1.0
0 -0.019875 -0.020356 0.016519 bin_10.0
0 -0.010753 -0.026369 0.021735 bin_100.0
0 0.011816 0.019509 0.010336 bin_101.0
0 -0.012836 -0.012418 0.038740 bin_102.0
0 -0.013034 -0.008422 0.008983 bin_103.0
0 -0.015615 -0.015442 0.022139 bin_104.0
0 -0.011075 0.006340 0.018839 bin_105.0
0 -0.012528 -0.002204 0.014608 bin_106.0
0 0.003264 -0.012552 0.012001 bin_107.0
0 -0.008267 -0.013056 0.031777 bin_108.0
0 -0.014031 -0.026050 0.011954 bin_109.0
0 -0.027372 -0.028189 0.012125 bin_11.0
0 -0.006538 -0.016846 0.020991 bin_110.0
0 0.028912 -0.002320 0.018458 bin_111.0
0 -0.011757 -0.021368 0.040606 bin_112.0
0 -0.014773 -0.022101 0.013958 bin_113.0
0 -0.010944 -0.001797 0.028481 bin_114.0
0 -0.016143 -0.028406 0.007117 bin_115.0
0 -0.013865 -0.021711 0.011233 bin_116.0
0 -0.009488 0.007354 0.008971 bin_117.0
0 -0.014187 -0.017223 0.044131 bin_118.0
0 -0.013005 -0.005250 0.011614 bin_119.0
0 -0.011601 0.010453 0.016970 bin_12.0
0 -0.012738 -0.004333 0.012729 bin_120.0
0 -0.013266 -0.016564 0.007020 bin_121.0
0 -0.015038 -0.042097 0.024701 bin_122.0
0 -0.012776 -0.004399 0.016492 bin_123.0
0 -0.012934 -0.018298 0.017719 bin_124.0
... ... ... ... ...
9 -0.017292 -0.010434 0.007727 bin_72.0
9 -0.009239 0.000410 0.007263 bin_73.0
9 -0.050343 -0.035553 0.016307 bin_74.0
9 -0.016550 -0.019668 0.007808 bin_75.0
9 -0.015879 -0.034310 0.014253 bin_76.0
9 -0.019754 -0.037949 0.018174 bin_77.0
9 -0.016839 -0.005070 0.016695 bin_78.0
9 -0.015295 -0.005584 0.012654 bin_79.0
9 -0.015647 -0.016262 0.008907 bin_8.0
9 -0.010676 -0.014450 0.010222 bin_80.0
9 -0.003566 0.010439 0.009648 bin_81.0
9 0.008290 0.015121 0.009266 bin_82.0
9 -0.004448 -0.014874 0.007668 bin_83.0
9 -0.012481 -0.017615 0.012226 bin_84.0
9 -0.018334 -0.007268 0.009883 bin_85.0
9 -0.017429 -0.029421 0.009856 bin_86.0
9 -0.000159 0.010534 0.008968 bin_87.0
9 -0.003924 -0.022100 0.018910 bin_88.0
9 0.008415 0.019052 0.020014 bin_89.0
9 -0.012393 -0.000086 0.010260 bin_9.0
9 0.006285 0.020495 0.012573 bin_90.0
9 -0.010193 -0.008106 0.008721 bin_91.0
9 -0.016792 -0.009178 0.012188 bin_92.0
9 0.008476 0.020195 0.010278 bin_93.0
9 0.005893 0.007117 0.008789 bin_94.0
9 -0.008254 -0.010829 0.017784 bin_95.0
9 0.004660 0.014164 0.009694 bin_96.0
9 0.011764 -0.004501 0.010030 bin_97.0
9 -0.017136 -0.026493 0.011077 bin_98.0
9 0.013644 0.033041 0.008518 bin_99.0
nikhase commented 7 years ago

Renaming "time" to "bin" and with numericals in the column, then passing to tsfresh:

extract_features(compressed_df, column_id="id", column_sort="bin")
MaxBenChrist commented 7 years ago

I am fine with changing the naming of the bins if we also change the name of the id column to bin column afterwards.

nikhase commented 7 years ago

Tiny correction: The id column stays the same, "time" is changed to "bin".