SugiharaLab / rEDM

Applications of Empirical Dynamic Modeling from Time Series
Other
117 stars 43 forks source link

Understand why a few manually calculated predictions differ from those calculated by rEDM #24

Closed andrew-edwards closed 6 years ago

andrew-edwards commented 6 years ago

I have written some R code to manually do simplex calculations in order to better understand the methods used in the rEDM package. I get results that exactly match those from rEDM for 92 out of the 99 predictions in my example data set. So this Issue is concerned with understanding why the remaining values differ (they differ for three different reasons), and confirming what rEDM is doing.

Maybe I am misunderstanding something or there are some assumptions in rEDM that I have not found documentation of. I have written up the details in the attached file rEDMissues.pdf and provided a minimal working example in mwe.r, which is copied below.

rEDMissues.pdf

Thanks for any insight - my manually calculated results for my simulated dataset actually suggest that the EDM method is doing better (higher correlation coefficient rho) than implied by rEDM.

File mwe.r is copied below (it seems that .r files cannot be inserted in GitHub comments like a .pdf can):

# mwe.r - minimal working example to demonstrate two of the discrepancies
#  between rEDM results and my R code. These are documented in the file
#  rEDMissues.pdf to be posted (with this file) as Issue #24 on the rEDM
#  GitHub site. Here I am including results from my R code to explain the
#  issues. This code is designed to be stepped through line-by-line in
#  conjunction with reading the comments and the .pdf. 
#  Andrew Edwards. 21st November 2017.

rm(list=ls())
require(rEDM)

# Xvec contains the values of X(t). The X(t) are first differences of output
#  from a Ricker-like model. X is not standardised (which is okay as univariate).
#  X(t) = N(t+1) - N(t) where N(t) are population numbers.
Xvec = c(-0.056531409251883,0.059223778257432,5.24124928046977,
    -4.85399581474521,-0.46134818068973,0.273317575696793,0.801806230470337,
    -0.888891901824982,-0.202777622745051,0.497565422757662,5.10219324323769,
    -5.36826459373178,-0.17467165498718,1.06545333399298,1.97419279178678,
    -2.91448405223082,-0.179969867858605,0.237962494942611,1.47828327622468,
    -1.54267507064286,-0.180342027136338,0.238919610831881,1.06140368490958,
    -1.06522901782019,-0.214923527940395,0.452847221462308,2.13053391555372,
    -2.55145224744286,-0.0307653352327702,1.1448014288826,-0.0675575239486375,
    -1.04711881585576,-0.00910890042051652,0.726257323277433,0.732271192186161,
    -1.35460378982395,-0.0322955446760023,0.507606440290776,3.73396587274012,
    -4.19686615950143,-0.0997201857962038,0.753392632401029,2.41347231553437,
    -3.03677401452137,-0.141112562089696,0.446002103079665,0.223768504955365,
    -0.615452831633047,-0.0216659723974975,0.292246351104258,0.20006105300258,
    -0.469596514211075,0.0422676544887819,0.474264989176278,-0.0416811459395667,
    -0.53555712696719,0.118860281628173,0.176335117268894,-0.10364820567334,
    -0.153572235117542,0.180339482186409,0.0566876206447625,-0.140537892644139,
    0.0252441742388871,0.340689505466622,0.852833653689839,-1.07051231019616,
    -0.0937704380137284,0.460677118593916,0.37444382348273,-0.83783628206217,
    -0.0154896108244113,1.34259279914848,-0.495978821807168,-0.472464634960208,
    -0.415481769949074,1.36767605087962,-0.891896943918948,-0.279228283931612,
    -0.148703043863421,2.04524590138255,-1.98431486665908,0.0602356391036573,
    -0.0902730939678147,0.243344379963862,-0.074421904114315,-0.309150440565139,
    0.43675531763949,0.178787692802827,0.0799271040758849,-0.657946157906476,
    1.14668210755046,-0.791665479471326,0.482533897248175,-0.798737571552661,
    0.439024256063545,0.177114631209318,2.19942374686687,-2.9488856529422)

n = length(Xvec)
n

simp.Efix = simplex(Xvec, E = 2, stats_only = FALSE)
rEDM.points = simp.Efix[[1]]$model_output       # time, obs, pred and pred_var
# Make each row correspond to t:
rEDM.points = rbind(c(1, Xvec[1], NA, NA), rEDM.points)
head(rEDM.points)

# Issue (i): rEDM seems to use x(t^*+2) as a nearest neighbour, using the
#  Deyle et al. (2013) notation that x is the vector of lagged values of X,
#  i.e. x = (X(t), X(t-1))

# For t^*=94, my R code calculates nearest neighbours with indices psivec
#  (I am using psi instead of the t with a line through in Deyle et al. 2013).
psivec94 = c(6, 57, 88)
# and corresponding weights,
weights94 = c(0.3678794, 0.3205861, 0.2895013)
# giving the estimate of X(95), from [S1] of Deyle et al. (2013) as
X95est = sum(weights94 * Xvec[psivec94+1]) / sum(weights94)
X95est

# However, rEDM gives
X95est.rEDM = rEDM.points[95, "pred"]
X95est.rEDM

# But I can reproduce the rEDM result by allowing x(96) to be a nearest neighbour
#  of x(94),  giving
psivec94.allow = c(96, 6, 57)
  # so 6 and 57 are now 2nd and 3rd nearest neighbours

weights94.allow = c(3.678794e-01, 1.405278e-04, 4.146457e-05)
# Note that the first weight is the same as above (by definition it's always
#  exp(-1)), but the second and third are very small because x[96] is
#  very close to x[94].

X95est.allow = sum(weights94.allow * Xvec[psivec94.allow+1]) /
    sum(weights94.allow)
X95est.allow

# However, surely we should not be allowed to use x(96) because, by definition,
#  it contains X(95), which is the value we are trying to estimate from x(94).
#  X(95) is presumably the 'one' that we are leaving out in 'leave-one-out'.

# I get the same issue with t^*=75.

# Issue (ii): for t^*=64, rEDM seems to only use the one nearest neighbour,
#  and estimates X(65) to be X(3)
X65est.rEDM = rEDM.points[65, "pred"]
X65est.rEDM
Xvec[3]
# So these are the same, and X(3) is the largest value of X in the time series
#  (I am not sure if that is important):
which.max(Xvec)

# Thus rEDM has essentially calculated weights of (w_1, w_2, w_3) = (1, 0, 0).
# However, I calculate weights of (0.3679, 0.1795, 0.1333):
psivec64 = c(2, 61, 60)
# With a vector of distances from x[64] of
dist64 = c(sqrt((Xvec[psivec64[1]] - Xvec[64])^2 + (Xvec[psivec64[1]-1] -
                                                             Xvec[64-1])^2),
           sqrt((Xvec[psivec64[2]] - Xvec[64])^2 + (Xvec[psivec64[2]-1] -
                                                             Xvec[64-1])^2),
           sqrt((Xvec[psivec64[3]] - Xvec[64])^2 + (Xvec[psivec64[3]-1] -
                                                             Xvec[64-1])^2))
# And corresponding weights
weights64 = exp(- dist64 / dist64[1])
# weights64 = c(0.3678794, 0.1795047, 0.1333414)  from my longer R knitr code
weights64

# My resulting estimate of X65 is:
X65est = sum(weights64 * Xvec[psivec64+1]) / sum(weights64)
X65est

# So I do not understand why the weights are different. The nearest neighbour
#  is not so close that the second and third are relatively far away, which
#  is issue (iii).

# I think issue (iii) may get resolved with further understanding of
#  issues (i) and (ii).
ha0ye commented 6 years ago

Issue 1 is expected behavior: the leave-one-out cross-validation is done on the reconstructed vectors. Thus, when creating a lagged block manually and using block_lnlp, the output will be the same.

To prevent a value from being used in any vectors, consider using exclusion_radius, to disallow neighbors from within a certain time-range centered at the target point.

Issue 2 is a bug in selecting nearest neighbors, and should now be fixed.