fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

Method (C) for determining if a row-selection corresponds to a range #35

Open MarcusKlik opened 6 years ago

MarcusKlik commented 6 years ago

By providing a helper C method, we can test for a range without creating intermediate vectors.

A first (non-optimized) try:

Rcpp::cppFunction('

  SEXP is_range(SEXP int_vec) {
    int vec_length = LENGTH(int_vec);

    int* int_values = INTEGER(int_vec);
    int counter = int_values[0];

    for (int i = 1; i < vec_length; i++) {
    counter++;
    if (int_values[i] != counter) return Rf_ScalarLogical(FALSE);
    }  

    return Rf_ScalarLogical(TRUE);
  }
')

# a range
is_range(1:1e6)
#> [1] TRUE

# not a range
is_range(c(1:1e6, 2L))
#> [1] FALSE

# performance
int_vec <- 1:1e7
timings <- microbenchmark::microbenchmark(
  is_range(int_vec)
)

# speed in billions of integers per second
length(int_vec) / median(timings$time)
#> [1] 1.746079

That's fast enough for practical purposes I think. Perhaps the method above can be improved by chunking the vector and determining the sum of squares of the vector minus the expected vector.

(that would minimize the number of if statements and may improve CPU throughput)

MarcusKlik commented 6 years ago

Such a method could also serve as a general range detector:

# another range
is_range(10:100)
#> [1] TRUE

If we can store ranges alongside random row selections (see also #29), that might be useful as well.

MarcusKlik commented 6 years ago

With R 3.5.0 and further, ranges can be detected much more easily because they are implemented with the ALTREP framework (so no custom range detection is needed anymore).