RfastOfficial / Rfast

A collection of Rfast functions for data analysis. Note 1: The vast majority of the functions accept matrices only, not data.frames. Note 2: Do not have matrices or vectors with have missing data (i.e NAs). We do no check about them and C++ internally transforms them into zeros (0), so you may get wrong results. Note 3: In general, make sure you give the correct input, in order to get the correct output. We do no checks and this is one of the many reasons we are fast.
143 stars 19 forks source link

function for approx NA by linear interpolation by row #12

Open earthcli opened 4 years ago

earthcli commented 4 years ago

I am finding a fast enough function such as zoo::na.approx, see here I want to do a simple linear interpolation for NA values of each row of a big matrix (52017455 x 150), generally I can do this by using the na.approx function in zoo package with apply function, however, this is too slow. I wonder whether the Rfast package is helpful for me, or could you add a function such as rowapprox in the new version.

x=rnorm(120)
x[c(3,8,16,22)]=NA
data=data.frame(t(x))
data=data[rep(1,52017455),]
result=apply(data,1,zoo::na.approx)
statlink commented 4 years ago

Hi earthcli, this is Michail. I saw the function in "zoo". The linear interpolation uses the "approx" function from the package "stats". But, I did not understand what it does. Do you know how the algortihm work? If you give me a hand then I can give you hand.

earthcli commented 4 years ago

@statlink The linear interpolation for NA approx is very simple, see here http://www.cplusplus.com/forum/general/168623/ or here https://www.ajdesigner.com/phpinterpolation/linear_interpolation_equation.php. I have wrote a simple Rcpp function as below, however, it is better to work with both matrix and data.frame.

// [[Rcpp::export]]
double cpplinterp_( NumericVector x, NumericVector y, double xout, const int maxgap=99999, const bool locf=false){
  // Linear approximation of a vector-function
  // x, y   vectors giving the coordinates of the points to be interpolated. 
  // x is assumed to be strictly monotonic.
  // xout   points at which to interpolate.
  int n_x=x.size();
  if( xout < x[0] ){
    if(locf){
      return y[0];
    }else{
      return NA_REAL;
    }
  }
  if( xout >= x[n_x-1] ){
    if(locf){
      return y[n_x-1];
    }else if(xout>x[n_x-1]){
      return NA_REAL;
    }else{
      return y[n_x-1];
    }    
  }

  int idx = 0 ;// Find the location of xout in x
  while ( x[idx] <= xout && idx < n_x ){
    idx++ ; 
  } 
  // Find the nonNA location of y around xout within maxgap
  int leftjust=1;
  int rightjust=0;
  while (NumericVector::is_na(y[idx-leftjust]) && idx-leftjust > 0 ){
    leftjust++; 
  }
  while (NumericVector::is_na(y[idx+rightjust]) && idx+rightjust < n_x ){
    rightjust++ ; 
  }
  if(x[idx+rightjust] - x[idx-leftjust] -1<=maxgap){
    double theta = ( xout - x[idx-leftjust] ) / ( x[idx+rightjust] - x[idx-leftjust] ) ; 
    return theta * y[idx+rightjust] + ( 1 - theta ) * y[idx-leftjust] ;  
  }else{
    return NA_REAL;
  }
}

// [[Rcpp::export]]
NumericVector cpplinterp( NumericVector x, NumericVector y, NumericVector xout, const int maxgap=99999, const bool locf=false){
  NumericVector yout=clone(xout);
  for(int i=0;i<yout.size();i++){
    yout[i]=cpplinterp_(x,y,xout[i],maxgap,false);
  }
  if(locf){
    for(int i=0;i<yout.size();i++){
      if(NumericVector::is_na(yout[i])){
        int idx=1;
        while (i+idx< yout.size()-1 && NumericVector::is_na(yout[i+idx]) && x[i+idx]-x[i] < maxgap){
          idx++ ; 
        }
        yout[i]=yout[i+idx];       
      }
    }     
    for(int i=yout.size()-1;i>=0;i--){
      if(NumericVector::is_na(yout[i])){
        int idx=1;
        while (i-idx>0 && NumericVector::is_na(yout[i-idx]) && x[i]-x[i-idx] < maxgap){
          idx++ ; 
        }
        yout[i]=yout[i-idx];       
      }
    } 
  }
  return yout;
}

// [[Rcpp::export]]
NumericVector cppmatrixapprox(NumericMatrix x, const int maxgap=99999, const bool locf=false, Rcpp::Nullable<Rcpp::NumericVector> idx=R_NilValue){
  // x is generally the indice of columns, assumed to be strictly monotonic.
  // idx    points at which to interpolate.  
  NumericMatrix yout=clone(x);
  NumericVector xout;
  if(idx.isNull()){
    xout=seq(1,x.ncol());
  }else{
    xout=idx;
  }
  for(int nr=0;nr<yout.nrow();nr++){
    if(locf){
      NumericVector rowx(x.ncol());
      rowx=cpplinterp(xout,yout(nr,_),xout,maxgap,locf);
      for(int nc=0;nc<yout.ncol();nc++){
        if(NumericVector::is_na(yout(nr,nc))){
          yout(nr,nc)=rowx[nc];
        }
      }
    }else{
      for(int nc=0;nc<yout.ncol();nc++){
        if(NumericVector::is_na(yout(nr,nc))){
          yout(nr,nc)=cpplinterp_(xout,yout(nr,_),xout[nc],maxgap,locf);
        }
      }    
    }
  }
  return yout;
}
earthcli commented 4 years ago

or here http://paulbourke.net/miscellaneous/interpolation/

statlink commented 4 years ago

Hi earthcli. You will have to give me some time. I will see what I can do in R, but bear in mind that I am very busy this month the maintainer is not currently available. In any case, we will try.

statlink commented 4 years ago

Hi earthcli, I cannot understand what is going on. Send me an e-mail at mtsagris@yahoo.gr to discuss this further there. Give me an example of your data, of a vector how they look like to make me understand your situation as right now I can only guess how your data look like and my guesses are obviously wrong.

earthcli commented 4 years ago

Hi , Thanks for your consideration. My data is as below. Supposing the raw vector x is : 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3 Generally, the first NA is approx by the left (-0.3) and right (1.3), the second NA is estimated approximately by -1.5 and -0.3, and I want to do this operation in a matrix or data.frame by row. for the first NA: Yout=Y1+(Y2-Y1)×(Xout-X1)/(X2-X1) 0.5=-0.3+(1.3--0.3)*(3-2)/(4-2)

The xout or idx parameter in the cpplinterp or cppmatrixapprox is the coordinate of x location, some times it may be not consecutive value, depending on the users. Some times there may be two consecutive NA, the algorithm has to find the non-NA value around the the NA that need to be estimated. For a vector, the x coordinate (xout) parameter generally is the index of x: seq_along (x) For the matrix, the x coordinate (xout/idx) parameter generally is the index of columns: seq_len(ncol(matrix)), but it may be defined by the user.

set.seed(0)
x=round(rnorm(8),1)
x[c(3,7)]=NA
zoo::na.approx(x)
cpplinterp(x=seq_along(x),y=x,xout=seq_along(x))

matrix=t(x)
matrix=matrix[rep(1,3),]
t(apply(matrix,1,zoo::na.approx))
cppmatrixapprox(matrix,idx=seq_len(ncol(matrix)))

set.seed(0) x=round(rnorm(8),1) x[c(3,7)]=NA zoo::na.approx(x) [1] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 cpplinterp(x=seq_along(x),y=x,xout=seq_along(x)) [1] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3

matrix=t(x) matrix=matrix[rep(1,3),] matrix [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3 [2,] 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3 [3,] 1.3 -0.3 NA 1.3 0.4 -1.5 NA -0.3

t(apply(matrix,1,zoo::na.approx)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [2,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [3,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 cppmatrixapprox(matrix,idx=seq_len(ncol(matrix))) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [2,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3 [3,] 1.3 -0.3 0.5 1.3 0.4 -1.5 -0.9 -0.3