SugiharaLab / pyEDM

Python package of EDM tools
Other
110 stars 27 forks source link

Interpretation on the prediction outputs? #53

Open zz100chan opened 1 week ago

zz100chan commented 1 week ago

Hello, thank you for making this interesting package! I am currently learning and testing out the package. While I understand the concept of the simplex/smap method, I couldn't get around the interpretation of the prediction from pred.

Since I intend to use the model for real time forecasting that could adjust prediction according to the every new data I received, I wonder if the prediction is done iteratively or sequentially here?

Taking this as example

Simplex( dataFrame = tentmap, lib = "1 500", pred = "501 510", 
         columns = "TentMap", target= "TentMap", E = 2, Tp=1, showPlot = True );

Time | Observations | Predictions | Pred_Variance
-- | -- | -- | --
501 | 0.90944 | NaN | NaN
502 | 0.18031 | 0.179541 | 1.692945e-07
503 | -0.52694 | -0.516232 | 3.860083e-05
504 | 0.49827 | 0.466669 | 4.504347e-04
505 | -0.99605 | -0.993870 | 5.458703e-13
506 | 0.88771 | 0.887230 | 3.393146e-11
507 | 0.22357 | 0.223970 | 1.717410e-06
508 | -1.13295 | -1.139083 | 3.450037e-04
509 | 0.89495 | 0.897663 | 1.261339e-05
510 | 0.20915 | 0.206739 | 4.697124e-05
511 | -0.93098 | -0.921785 | 1.837824e-05

Is the prediction of row 504 taking the observation from row 502-503 (0.18, -0.52) as the data (since E = 2?and so on for row 505 prediction (using row 503-504 observation (-0.52, 0.49))? or the row 504 prediction is based on the row 502-3 prediction (0.17, -0.51)?

And a follow-up on simplex/smap/multiview/CCM model, would that be possible to save up the model separately, so that we can predict without reconstructing the library every time? This could save up a lot of computational time imo

Thank you so much!

SoftwareLiteracy commented 1 week ago

Thank you for the question.

It seems your interpretation is from a traditional, time-based perspective. EDM methods are state space methods. The library (lib) defines observation rows that create the state space. Predictions are made by projecting nearest neighbors in the state space Tp time intervals ahead. You may wish to review some backgound material:

The EDM Framework

Embedding

EDM Algorithms

After review you may find the answer to the questions :
Is the prediction of row 504 taking the observation from row 502-503 (0.18, -0.52) as the data (since E = 2?and so on for row 505 prediction (using row 503-504 observation (-0.52, 0.49))? or the row 504 prediction is based on the row 502-3 prediction (0.17, -0.51)? is generally no. Although, if it happens that the cited data correspond to points in the state space which are nearest neighbors to the prediction point, then yes, the state vector of those points is used. EDM predictions are not based on neighbors in time, but neighbors in state.

Regarding your application: "I intend to use the model for real time forecasting that could adjust prediction according to the every new data", this can be done by adding your new data point to the time series/embedding and running the prediction. A new state space/neighbors will be computed.

Is it possible to save the library (state space) separately? The complete answer is no, not in the current package in the sense of reusing the state space/neighbors from a previous invocation.

However, a partial answer is yes since the embedded = True parameter presumes the dataFrame provides a valid embedding. However, the state space distances and neighbors are computed every time Simplex or SMap are called.

Further, a partial answer is yes since the Simplex class object which contains the computed knn_neighbors can be returned with the returnObject = True flag. However, the current code does not provide a way to reuse these neighbors as input. This is a good suggestion we will consider for package enhancement.

Here is a brief example of accessing the knn_neighbors

from pyEDM import *
s = Simplex( dataFrame = sampleData['Lorenz5D'], columns = 'V1', target = 'V5', 
             E = 5, lib = [1,500], pred = [601,610] )
type(s)
<class 'pandas.core.frame.DataFrame'>

s_ = Simplex( dataFrame = sampleData['Lorenz5D'], columns = 'V1', target = 'V5',
              E = 5, lib = [1,500], pred = [601,610], returnObject = True )
type(s_)
<class 'pyEDM.Simplex.Simplex'>
type(s_.knn_neighbors)
<class 'numpy.ndarray'>
s_.knn_neighbors.shape
(10, 6)
s_.knn_neighbors
array([[428, 297,  74, 134, 429, 298],
       [135, 429, 191,  75, 430, 298],
       [192, 136,  76, 474, 430, 237],
       [193, 137,  77, 475,  76,   4],
       [194, 138, 306,  78, 475,  77],
       [195,  79, 139, 477, 476,   6],
       [ 80, 196, 140, 312, 433, 184],
       [ 81, 197, 141, 313,  54, 114],
       [ 82, 142, 198, 115, 488,  55],
       [143, 199,  83, 287, 227, 489]])
zz100chan commented 1 week ago

Thank you so much for the prompt and detailed reply!

From my understanding, simplex constructs the library from the training data according to the Tp, Tau, E...etc. And once we have input data, we embed it into a state vector (in this case with E=2, tau=1, then [x(t), x(t-1)]) and search for its k nearest neighbors in the library. And with the future states ((t+1)) of these neighbors, we take the weighted average of them and use it as a projection for x(t+1).

My question of 'taking the observation from row 502-503' was meant to be, if the row 504 prediction is using a state vector [-0.52, 0.18] from row 502-503 to look for the nearest neighbors in the library, and so for row 505 prediction using state vector [-0.52, 0.49]. And from my understanding of your answer, I am correct in this interpretation?

SoftwareLiteracy commented 1 week ago

Yes, I think this expression is correct.

Just to be entirely pedantic, Tp is not used in creating a time delay embedding, only E and tau. In pyEDM time delays are denoted with negative tau.

Your question leads naturally to the concept of exclusion radius. If the dynamics have significant serial correlation then one may wish to exclude nearest neighbors that are serially correlated (linear dynamics). This is handled with the exclusionRadius parameter which ignores neighbors within a temporal exclusion radius.