Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 981 forks source link

nafill new type: approx #4066

Open ColeMiller1 opened 4 years ago

ColeMiller1 commented 4 years ago

This is a FR. Also I apologize ahead of time if this is not an appropriate post for here.

From this StackOverflow post:

library(data.table)
dt <- data.table(dist = c(31091.33, NA, 31100.00, 31103.27, NA, NA, NA, NA, 31124.98))
dt[, time := .I]

##       dist time
##1: 31091.33    1
##2:       NA    2
##3: 31100.00    3
##4: 31103.27    4
##5:       NA    5
##6:       NA    6
##7:       NA    7
##8:       NA    8
##9: 31124.98    9

OP wants to do a linear interpolation to fill in the NA values. The most simple solution is to use base approx() or zoo:na.approx() per user dww:

copy(dt)[, dist := zoo::na.approx(dist)][]
copy(dt)[, dist := approx(.I, dist, .I)$y][]

#       dist time
#1: 31091.33    1
#2: 31095.67    2
#3: 31100.00    3
#4: 31103.27    4
#5: 31107.61    5
#6: 31111.95    6
#7: 31116.30    7
#8: 31120.64    8
#9: 31124.98    9

The current nafill types are: c("const", "locf", "nocb") - an "approx" type would be helpful. I wrote an Rcpp implementation to show that there is room for improvement as it's ~10 times faster than the base method which is already a C based method:

#data per @chinsoon12
nr <- 1e7
nNA <- nr/2
DT <- data.table(time=1:nr, dist=replace(rnorm(nr), sample(1:nr, nNA), NA_real_))

bench::mark(
  rcpp_appr = rcpp_approx(DT[['dist']]),
  base_approx = DT[, approx(.I, dist, .I)$y],
  zoo_approx = zoo::na.approx(DT[['dist']]),
  time_unit = 's'
)

# A tibble: 3 x 13 (in seconds)
  expression    min median `itr/sec` mem_alloc
  <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>
1 rcpp_appr   0.126  0.132     7.08       76MB
2 base_approx 1.34   1.34      0.745     617MB
3 zoo_approx  2.22   2.22      0.450    1630MB ##manually changed from GB to MB

# this is the rcpp code
Rcpp::sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector rcpp_approx(NumericVector y) {
  double start = 0, slope = 0;
  int count = 0;

  NumericVector y1 = clone(y); //added to not update-by-reference

  for(int i = 0; i < y1.size(); ++i){
    if (NumericVector::is_na(y1[i])){
      count++;
    } else {
      if (count != 0) {
        start = y1[i - (count+1)];
        slope = (y1[i] - start) / (count + 1);
        for (int j = 0; j < count; j++){
          y1[i-(count-j)] = start + slope * (j + 1);
        }
      count = 0;
      } 
    }
  }
return(y1);
}
')

Thank you for your consideration.

MichaelChirico commented 4 years ago

Agree linear interpolation is common enough we could potentially support. Could you help to clarify though:

copy(dt)[, dist := zoo::na.approx(dist)][]
copy(dt)[, dist := approx(.I, dist, .I)$y][]

The zoo version doesn't use .I, and neither version uses your time column. I also think maybe the example is a bit too simple since time is always increasing by one, and there are no duplicates in time, could you add some expected output in more complicated scenarios? Is this doing "local linear" interpolation (find the "nearest" non-NA values and draw a line between them), or is it "global linear" (we fit a 2-D [with constant] regression on all non-NA values and predict the fill value)?

Also worth mentioning that IINM this would be an API departure since it requires two input columns (the other types only need one) -- the "target" column and the "reference" column.

jangorecki commented 4 years ago

It is proper place to fill FR. Thanks Related to #3241, in case where we want to fill approx NAs just to apply rolling function on that, after #3241 we wouldn't need to materialize approximated values.

ColeMiller1 commented 4 years ago

The examples of the original post were for simple "local linear" interpolation. If you are open for 2D implementation, I could see it as:

setnafill(x, type = 'approx')

#which would call:
setnafill2d(x, method = 'local')

# setnafill2d(x, y = NULL, method = c('local', 'linear'))

Examples:

set.seed(123)
dt <- data.table(x = sort(sample(100, 10)), 
                 y = sort(sample(100, 10)))

dt[c(2,5,6,8), y := NA_integer_]

     x  y
 1: 14  9
 2: 25 NA
 3: 31 57
 4: 42 69
 5: 43 NA
 6: 50 NA
 7: 51 91
 8: 67 NA
 9: 79 93
10: 97 99

##     local interpolation ##
zoo::na.approx(dt[['y']])

##nafill(dt[['y']], type = 'approx'))
##nafill2d(dt[['y']], method = 'local')

 [1]  9.00000 33.00000 57.00000 69.00000 76.33333 83.66667 91.00000 92.00000
 [9] 93.00000 99.00000

##     linear interpolation    ##
zoo::na.approx(dt[['y']], dt[['x']])

##nafill2d(dt[['y']], dt[['x']], method = 'linear') ##no alternative

 [1]  9.00000 40.05882 57.00000 69.00000 71.44444 88.55556 91.00000 92.14286
 [9] 93.00000 99.00000

Regarding rolling joins, chinsoon12 provided two additional methods, one with rolling joins and the other with nafill. The rolling join method was approximately the same speed as base::approx()


# A tibble: 7 x 13  (in seconds)
  expression          min median `itr/sec` mem_alloc
  <bch:expr>        <dbl>  <dbl>     <dbl> <bch:byt>
1 rcpp_local        0.137  0.181     5.96       76MB
2 rcpp_linear_intx  0.142  0.159     5.77       76MB
3 rcpp_linear_doubx 0.149  0.160     6.35      152MB
4 dt_fill_method    0.677  0.677     1.48      667MB
5 dt_rolling_join   1.35   1.35      0.740     858MB
6 base_approx       1.33   1.33      0.750     617MB
7 zoo_approx        2.37   2.37      0.422    1630GB

## chinsoon12
dt_roll_approx <- function() {
  i <- DT[is.na(dist), which=TRUE]
  DT[i, {
    x0y0 <- DT[-i][.SD, on=.(time), roll=Inf, .(time=x.time, dist=x.dist)]
    x1y1 <- DT[-i][.SD, on=.(time), roll=-Inf, .(time=x.time, dist=x.dist)]
    (x1y1$dist - x0y0$dist) / (x1y1$time - x0y0$time) * (time - x0y0$time) + x0y0$dist
  }]
}

## chinsoon12
dt_nafill_approx <- function() {
  i <- DT[is.na(dist), which=TRUE]
  DT[, {
    y0 <- nafill(dist, "locf")
    x0 <- nafill(replace(time, i, NA), "locf")
    y1 <- nafill(dist, "nocb")
    x1 <- nafill(replace(time, i, NA), "nocb")
    fifelse(is.na(dist), (y1 - y0) / (x1 - x0) * (time - x0) + y0, dist)
  }]
}

Rcpp::sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector rcpp_approx2D_xint_ydoub(IntegerVector x, NumericVector y) {
  double x_start = 0, y_start = 0, slope = 0;
  int count = 0;

  NumericVector y1 = clone(y); //added to not update-by-reference

  for(int i = 0; i < y1.size(); ++i){
    if (NumericVector::is_na(y1[i])){
      count++;
    } else {
      if (count != 0) {
        x_start = x[i-(count+1)];
        y_start = y1[i-(count+1)];
        slope = (y1[i] - y_start) / (x[i]- x_start);
        for (int j = 0; j < count; j++){
          y1[i-(count-j)] = y_start + slope * (x[i - (count - j)] - x_start);
        }
        count = 0;
      }
    }
  }
  return(y1);
}

// [[Rcpp::export]]
NumericVector rcpp_approx2D(NumericVector x, NumericVector y) {
  double x_start = 0, y_start = 0, slope = 0;
  int count = 0;

  NumericVector y1 = clone(y); //added to not update-by-reference

  for(int i = 0; i < y1.size(); ++i){
    if (NumericVector::is_na(y1[i])){
      count++;
    } else {
      if (count != 0) {
        x_start = x[i-(count+1)];
        y_start = y1[i-(count+1)];
        slope = (y1[i] - y_start) / (x[i]- x_start);
        for (int j = 0; j < count; j++){
          y1[i-(count-j)] = y_start + slope * (x[i - (count - j)] - x_start);
        }
        count = 0;
      }
    }
  }
  return(y1);
}

// [[Rcpp::export]]
NumericVector rcpp_approx(NumericVector y) {
  double start = 0, slope = 0;
  int count = 0;

  NumericVector y1 = clone(y); //added to not update-by-reference

  for(int i = 0; i < y1.size(); ++i){
    if (NumericVector::is_na(y1[i])){
      count++;
    } else {
      if (count != 0) {
        start = y1[i - (count+1)];
        slope = (y1[i] - start) / (count + 1);
        for (int j = 0; j < count; j++){
          y1[i-(count-j)] = start + slope * (j + 1);
        }
        count = 0;
      } 
    }
  }
  return(y1);
}
')

#data per @chinsoon12
set.seed(0)
nr <- 1e7
nNA <- nr/2
DT <- data.table(time=1:nr, dist=replace(rnorm(nr), sample(1:nr, nNA), NA_real_))

bench::mark(
  rcpp_local = rcpp_approx(DT[['dist']])
  ,rcpp_linear_intx = rcpp_approx2D_xint_ydoub(DT[['time']], DT[['dist']])
  ,rcpp_linear_doubx = rcpp_approx2D(DT[['time']], DT[['dist']])
  ,dt_fill_method = dt_nafill_approx()
  , dt_rolling_join = dt_roll_approx()
  ,base_approx = DT[, approx(.I, dist, .I)$y]
  ,zoo_approx = zoo::na.approx(DT[['dist']])
  , check = F
  , time_unit = 's'
)
jangorecki commented 4 years ago

When presenting benchmarks please use same unit for all approaches. Showing seconds and milliseconds values for the same measure is unnecessarily confusing.

ColeMiller1 commented 4 years ago

Updated all benchmarks with one unit and included the full implementation of Rcpp for reference.

One thing to note for any 2D interpolation is that the class of x matters as well so that would be one more thing to deal with. This shows up in the benchmarks as it appears the template for NumericVector will cause a deep copy if you are sending an integer R object (...I think).

# A tibble: 7 x 13   (in seconds)
  expression          min median `itr/sec` mem_alloc
  <bch:expr>        <dbl>  <dbl>     <dbl> <bch:byt>
1 rcpp_local        0.137  0.181     5.96       76MB
2 rcpp_linear_intx  0.142  0.159     5.77       76MB
3 rcpp_linear_doubx 0.149  0.160     6.35      152MB
4 dt_fill_method    0.677  0.677     1.48      667MB
5 dt_rolling_join   1.35   1.35      0.740     858MB
6 base_approx       1.33   1.33      0.750     617MB
7 zoo_approx        2.37   2.37      0.422    1630GB
hope-data-science commented 4 years ago

May I ask when can nafill and setnafill support filling non-numeric variables?

jangorecki commented 4 years ago

@hope-data-science no ETA, unless there is open PR already, so we can prioritise it to include for next release.

hope-data-science commented 4 years ago

I do have a prototype using R codes:

down_fill = function(x){
  for(i in seq_along(x)){
    if(!is.na(x[i]) | i == 1) next
    else if(is.na(x[i]) & is.na(x[i-1])) next
    else x[i] <- x[i-1]
  }
  x
}

up_fill = function(x){
    rev(down_fill(rev(x) ) )
}

They work, but not fast. I think they might be faster in C. And these codes could be more efficient too.

hope-data-science commented 4 years ago

I've got a new one ('down' for LOCF, 'up' for NOCB):

shift_fill = function(x,direction = "down"){
  if(direction == "down") type = "lag"
  else if(direction == "up") type = "lead"
  repeat{
    x = fcoalesce(x,shift(x,type = type))
    if(!anyNA(x)) break
  }
  x
}

This should be faster, especially when there are shorter NA gaps.

tdhock commented 4 years ago

@hope-data-science nafill for non-numeric values is discussed in #3992 , which has a workaround

hope-data-science commented 4 years ago

@hope-data-science nafill for non-numeric values is discussed in #3992 , which has a workaround

Thank you for noting. Can you wrap it in a function? I have actually, in tidyfst::fill_na_dt. But I'm not sure if it is fast enough.

Thanks.