function for approx NA by linear interpolation by row

earthcli commented 4 years ago

I am finding a fast enough function such as zoo::na.approx, see here I want to do a simple linear interpolation for NA values of each row of a big matrix (52017455 x 150), generally I can do this by using the na.approx function in zoo package with apply function, however, this is too slow. I wonder whether the Rfast package is helpful for me, or could you add a function such as rowapprox in the new version.

x=rnorm(120)
x[c(3,8,16,22)]=NA
data=data.frame(t(x))
data=data[rep(1,52017455),]
result=apply(data,1,zoo::na.approx)

statlink commented 4 years ago

Hi earthcli, this is Michail. I saw the function in "zoo". The linear interpolation uses the "approx" function from the package "stats". But, I did not understand what it does. Do you know how the algortihm work? If you give me a hand then I can give you hand.

earthcli commented 4 years ago

@statlink The linear interpolation for NA approx is very simple, see here http://www.cplusplus.com/forum/general/168623/ or here https://www.ajdesigner.com/phpinterpolation/linear_interpolation_equation.php. I have wrote a simple Rcpp function as below, however, it is better to work with both matrix and data.frame.

// [[Rcpp::export]]
double cpplinterp_( NumericVector x, NumericVector y, double xout, const int maxgap=99999, const bool locf=false){
  // Linear approximation of a vector-function
  // x, y   vectors giving the coordinates of the points to be interpolated. 
  // x is assumed to be strictly monotonic.
  // xout   points at which to interpolate.
  int n_x=x.size();
  if( xout < x[0] ){
    if(locf){
      return y[0];
    }else{
      return NA_REAL;
    }
  }
  if( xout >= x[n_x-1] ){
    if(locf){
      return y[n_x-1];
    }else if(xout>x[n_x-1]){
      return NA_REAL;
    }else{
      return y[n_x-1];
    }    
  }

  int idx = 0 ;// Find the location of xout in x
  while ( x[idx] <= xout && idx < n_x ){
    idx++ ; 
  } 
  // Find the nonNA location of y around xout within maxgap
  int leftjust=1;
  int rightjust=0;
  while (NumericVector::is_na(y[idx-leftjust]) && idx-leftjust > 0 ){
    leftjust++; 
  }
  while (NumericVector::is_na(y[idx+rightjust]) && idx+rightjust < n_x ){
    rightjust++ ; 
  }
  if(x[idx+rightjust] - x[idx-leftjust] -1<=maxgap){
    double theta = ( xout - x[idx-leftjust] ) / ( x[idx+rightjust] - x[idx-leftjust] ) ; 
    return theta * y[idx+rightjust] + ( 1 - theta ) * y[idx-leftjust] ;  
  }else{
    return NA_REAL;
  }
}

// [[Rcpp::export]]
NumericVector cpplinterp( NumericVector x, NumericVector y, NumericVector xout, const int maxgap=99999, const bool locf=false){
  NumericVector yout=clone(xout);
  for(int i=0;i<yout.size();i++){
    yout[i]=cpplinterp_(x,y,xout[i],maxgap,false);
  }
  if(locf){
    for(int i=0;i<yout.size();i++){
      if(NumericVector::is_na(yout[i])){
        int idx=1;
        while (i+idx< yout.size()-1 && NumericVector::is_na(yout[i+idx]) && x[i+idx]-x[i] < maxgap){
          idx++ ; 
        }
        yout[i]=yout[i+idx];       
      }
    }     
    for(int i=yout.size()-1;i>=0;i--){
      if(NumericVector::is_na(yout[i])){
        int idx=1;
        while (i-idx>0 && NumericVector::is_na(yout[i-idx]) && x[i]-x[i-idx] < maxgap){
          idx++ ; 
        }
        yout[i]=yout[i-idx];       
      }
    } 
  }
  return yout;
}

// [[Rcpp::export]]
NumericVector cppmatrixapprox(NumericMatrix x, const int maxgap=99999, const bool locf=false, Rcpp::Nullable<Rcpp::NumericVector> idx=R_NilValue){
  // x is generally the indice of columns, assumed to be strictly monotonic.
  // idx    points at which to interpolate.  
  NumericMatrix yout=clone(x);
  NumericVector xout;
  if(idx.isNull()){
    xout=seq(1,x.ncol());
  }else{
    xout=idx;
  }
  for(int nr=0;nr<yout.nrow();nr++){
    if(locf){
      NumericVector rowx(x.ncol());
      rowx=cpplinterp(xout,yout(nr,_),xout,maxgap,locf);
      for(int nc=0;nc<yout.ncol();nc++){
        if(NumericVector::is_na(yout(nr,nc))){
          yout(nr,nc)=rowx[nc];
        }
      }
    }else{
      for(int nc=0;nc<yout.ncol();nc++){
        if(NumericVector::is_na(yout(nr,nc))){
          yout(nr,nc)=cpplinterp_(xout,yout(nr,_),xout[nc],maxgap,locf);
        }
      }    
    }
  }
  return yout;
}

earthcli commented 4 years ago

or here http://paulbourke.net/miscellaneous/interpolation/

statlink commented 4 years ago

Hi earthcli. You will have to give me some time. I will see what I can do in R, but bear in mind that I am very busy this month the maintainer is not currently available. In any case, we will try.

statlink commented 4 years ago

Hi earthcli, I cannot understand what is going on. Send me an e-mail at mtsagris@yahoo.gr to discuss this further there. Give me an example of your data, of a vector how they look like to make me understand your situation as right now I can only guess how your data look like and my guesses are obviously wrong.

earthcli commented 4 years ago

Hi , Thanks for your consideration. My data is as below. Supposing the raw vector x is : 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3 Generally, the first NA is approx by the left (-0.3) and right (1.3), the second NA is estimated approximately by -1.5 and -0.3, and I want to do this operation in a matrix or data.frame by row. for the first NA: Yout=Y1+(Y2-Y1)×(Xout-X1)/(X2-X1) 0.5=-0.3+(1.3--0.3)*(3-2)/(4-2)

The xout or idx parameter in the cpplinterp or cppmatrixapprox is the coordinate of x location, some times it may be not consecutive value, depending on the users. Some times there may be two consecutive NA, the algorithm has to find the non-NA value around the the NA that need to be estimated. For a vector, the x coordinate (xout) parameter generally is the index of x: seq_along (x) For the matrix, the x coordinate (xout/idx) parameter generally is the index of columns: seq_len(ncol(matrix)), but it may be defined by the user.

set.seed(0)
x=round(rnorm(8),1)
x[c(3,7)]=NA
zoo::na.approx(x)
cpplinterp(x=seq_along(x),y=x,xout=seq_along(x))

matrix=t(x)
matrix=matrix[rep(1,3),]
t(apply(matrix,1,zoo::na.approx))
cppmatrixapprox(matrix,idx=seq_len(ncol(matrix)))

set.seed(0) x=round(rnorm(8),1) x[c(3,7)]=NA zoo::na.approx(x) [1] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 cpplinterp(x=seq_along(x),y=x,xout=seq_along(x)) [1] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3

matrix=t(x) matrix=matrix[rep(1,3),] matrix [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3 [2,] 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3 [3,] 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3

t(apply(matrix,1,zoo::na.approx)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [2,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [3,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 cppmatrixapprox(matrix,idx=seq_len(ncol(matrix))) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [2,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [3,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3

RfastOfficial / Rfast

function for approx NA by linear interpolation by row #12