dcooley / sfheaders

Build sf objects from R and Rcpp
https://dcooley.github.io/sfheaders/
Other
74 stars 5 forks source link

interleaved geometry #78

Open dcooley opened 4 years ago

dcooley commented 4 years ago

coordinates stored in a single vector

[x,y,z,x,y,z,x,y,z,...]

And associated meta-data describing start indices, stride length


References & discussion


TODO


Prototype

library(sfheaders)

sf <- mapdeck::roads

it <- sfheaders:::rcpp_interleave_sf( sf )

str( it )

List of 6
 $ coordinates      : num [1:115514] 145 -37.8 145 -37.8 145 ...
 $ start_indices    : int [1:18286] 0 20 46 62 70 74 81 91 115 125 ...
 $ n_coordinates    : int [1:18286] 20 26 16 8 4 7 10 24 10 10 ...
 $ total_coordinates: num 57757
 $ stride           : int 2
 $ data             :'data.frame':  18286 obs. of  15 variables:
  ..$ EZI_RDNAME: chr [1:18286] "MAIN YARRA TRAIL" "YARRA BOULEVARD" "UNNAMED" "YARRA-CITYLINK OUT RAMP ON" ...
  ..$ FQID      : num [1:18286] 5516 1347 5516 3316 3470 ...
  ..$ FROM_UFI  : num [1:18286] 16141399 2319828 16141401 39157640 2319921 ...
  ..$ FTYPE_CODE: chr [1:18286] "road" "road" "road" "road" ...
  ..$ LEFT_LOC  : chr [1:18286] "BURNLEY" "BURNLEY" "BURNLEY" "BURNLEY" ...
  ..$ PFI       : num [1:18286] 11815534 5684911 11749283 5684402 5684635 ...
  ..$ RD_NAME1  : chr [1:18286] "CAPITAL CITY" NA NA NA ...
  ..$ RD_NAME2  : chr [1:18286] NA NA NA NA ...
  ..$ RD_TYPE1  : chr [1:18286] "TRAIL" NA NA NA ...
  ..$ RD_TYPE2  : chr [1:18286] NA NA NA NA ...
  ..$ RIGHT_LOC : chr [1:18286] "BURNLEY" "BURNLEY" "BURNLEY" "BURNLEY" ...
  ..$ ROAD_NAME : chr [1:18286] "MAIN YARRA" "YARRA" "UNNAMED" "YARRA-CITYLINK OUT" ...
  ..$ ROAD_TYPE : chr [1:18286] "TRAIL" "BOULEVARD" NA "RAMP" ...
  ..$ TO_UFI    : num [1:18286] 39357869 39364350 2319308 39157656 45130685 ...
  ..$ UFI       : num [1:18286] 39352041 39364294 16140773 39157330 45113563 ...
SymbolixAU commented 4 years ago

some benchmarks

library(microbenchmark)

## Need to get this speed down.
microbenchmark(
  df_T = { sfheaders::sf_to_df( mapdeck::roads, fill = T ) },
  df_F = { sfheaders::sf_to_df( mapdeck::roads, fill = F ) },
  it = { sfheaders:::rcpp_interleave_sf( mapdeck::roads ) },
  times = 10
)

# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# df_T 29.93938 30.87460 32.42612 32.43507 34.31048 34.82401    10
# df_F 23.60596 24.32471 28.00786 28.39502 29.48871 34.00938    10
#   it 28.92496 30.99812 34.33903 35.01555 36.09556 40.84551    10

## interleaving benchmarks

library(Rcpp)

cppFunction(

  code = '
    Rcpp::NumericVector interleave( Rcpp::NumericMatrix& mat ) {

      R_xlen_t n_col = mat.ncol();
      R_xlen_t n_row = mat.nrow();
      Rcpp::NumericVector res( n_col * n_row );
      R_xlen_t i, j;
      R_xlen_t position_counter = 0;
      for( i = 0; i < n_row; ++i ) {
        for( j = 0; j < n_col; ++j ) {
          res[ position_counter ] = mat( i, j );
          position_counter = position_counter + 1;
        }
      }
      return res;

    }
  '
)

cppFunction(
  code = '
    Rcpp::NumericVector interleave_t( Rcpp::NumericMatrix& mat ) {
      Rcpp::NumericMatrix m2 = Rcpp::transpose( mat );
      m2.attr("dim") = R_NilValue;
      return m2;
    }
  '
)

cppFunction(
  code = '
    Rcpp::NumericVector interleave_cols( Rcpp::NumericMatrix& mat ) {
      Rcpp::NumericVector col1 = mat( Rcpp::_, 0 );
      Rcpp::NumericVector col2 = mat( Rcpp::_, 1 );
      R_xlen_t n_row = mat.nrow();
      Rcpp::NumericVector res( n_row + n_row );
      R_xlen_t i;
      R_xlen_t counter = 0;
      for( i = 0; i < n_row; ++i ) {
        res[ counter ] = col1[ i ];
        res[ counter + 1 ] = col2[ i ];
        counter = counter + 1;
      }
      return res;
    }
  '
)

## https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/vector/Matrix.h#L405
cppFunction(
  code = '
    Rcpp::NumericVector interleave_impl( Rcpp::NumericMatrix& mat ) {
      R_xlen_t nrow = mat.nrow(), ncol = mat.ncol();
      R_xlen_t len = nrow * ncol;
      R_xlen_t len2 = len - 1;
      Rcpp::NumericVector res( len );
      R_xlen_t i, j;
      for( i = 0, j = 0; i < len; ++i, j += nrow ) {
        if (j > len2) j -= len2;
        res[i] = mat[j];
      }
      return res;
    }
  '
)

m <- matrix(rnorm(3e1), ncol = 3)
as.numeric( t(m) )
interleave_t( m )
interleave_impl( m )

library(microbenchmark)

m <- matrix(rnorm(3e7), ncol = 3)

microbenchmark(
  loop = { interleave( m ) },
  tran = { interleave_t( m ) },
  cols = { interleave_cols( m ) },
  impl = { interleave_impl( m ) },
  times = 25
)

# Unit: milliseconds
# expr      min       lq     mean   median       uq       max neval
# loop 55.18919 64.50406 70.85069 67.40234 71.77256 130.16606    25
# tran 51.28294 52.17571 60.82680 62.26534 68.09855  69.24250    25
# cols 60.38587 61.41036 64.24787 62.21885 67.20412  70.19668    25
# impl 48.87238 50.17013 58.46313 59.21523 65.89245  68.35909    25
kylebarron commented 4 years ago

It's been years since I've used R. If you have suggestions/comments about creating sf data structures from different types of Arrow structures, we'd love to hear.

dcooley commented 4 years ago

Cheers @kylebarron. From reading all the discussions I think the conversion from Arrow to sf will be pretty straight forward, so I'm trying to align these "interleaving" functions with what is being discussed.


And for what it's worth, this comment

One of the core concepts of Arrow is that it is a columnar layout, so which means all values in a single column (here: all geometries of one column) are stored in one contiguous array.

Is what I was thinking about when I asked this question

is your proposed size a single value, or an array? That is, is it possible (or does it make sense?) to encode two linestrings with different dimensions

(And like you I'm coming at this from a Deck.gl performance perspective)

kylebarron commented 4 years ago

Yes, I think I was a little confused. I didn't understand that each key of the struct was stored as a separate contiguous array. So the abstraction is that each struct is an independent record, but internally each value of the struct is an Array (possibly an Array of Arrays of Arrays ...). So the size is essentially an array internally that varies across rows, but not within a single record.

Anyways, I think there's still some good discussion to be had, so I wouldn't spend too much effort implementing any one proposal yet.

dcooley commented 4 years ago

Anyways, I think there's still some good discussion to be had, so I wouldn't spend too much effort implementing any one proposal yet.

yeah absolutely. I'm not "productionising" anything yet, just setting up some processes ready to hit whatever structure is needed.

mdsumner commented 4 years ago

How could you encode structure in this interleave form? (I guess it's too early to say for sfheaders yeah)

There's no mapping between $data and $start_indices and I'm drawn to this comment: https://github.com/geopandas/geo-arrow-spec/issues/4 the 'into parts / into rings / into values' stuff

The discussion there leaves me wondering what to do about triangles, you really don't want expanded interleaved versions of those, and it seems missing when you can have a packed 'values' array and ' indexing values' (it's missing from the SF standard like a crime from long dead ancestors) so I wonder if you've encountered the topic elsewhere?

dcooley commented 4 years ago

How could you encode structure in this interleave form

what do you mean by "structure" exactly?

There's no mapping between $data and $start_indices

The index of the [start_indices] array is the row-index of [data]

mdsumner commented 4 years ago

oh, sorry - quite right - I mean for multi-part types, is it just because it's LINESTRING above? - there's only run-length no parts- hierarchy (that's all I mean by structure)

dcooley commented 4 years ago

oh I see. Yeah I've not fully thought about it yet. Was going to come back to this once geometries and sfheaders are updated. There's a solution out there somewhere...

mdsumner commented 4 years ago

Cool, it's interesting - so refreshing to see such discussion, and I was pretty dismissive at first, I was exploring these things long ago and it seems the important stuff is not where I expected

mdsumner commented 4 years ago

The crux is in how we store indexed coordinates! I'm sure of it, we have to have triangles and segments in dense form and this seems so close to that now

dcooley commented 4 years ago

we have to have triangles

I should mention, my v0.0.1 plan of {interleave} is to create (only) POINTs, LINEs and Triangles (via ear-cutting) from any geometry (hence {geometries}), as these will feed directly into {mapdeck}.

mdsumner commented 4 years ago

Oh and of course, that means run length is enough - no need for part/hole nonsense when you only have foreground pieces. That's worth dwelling on ... seems like good convergence!