fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 42 forks source link

Can fst be used with geospatial data? #232

Open SimonCoulombe opened 4 years ago

SimonCoulombe commented 4 years ago

I have a dataset of car trips.

The dataset has four columns ( trip_id timestamp, lat, lon) , and each trip_id will have a few hundreds rows with points showing the location every few second.

Is fst able to read/write geospatial data (such as 'sf' data frames), or do I have to save it as a regular data.frame ?

If not, a suggestion would be welcome :)

MarcusKlik commented 4 years ago

Hi @SimonCoulombe, thanks for your question!

Yes, it would be extremely useful to be able to serialize simple features columns. That would make serialization of geospatial objects efficient. Also, it would be possible to do a selection of features on-disk, that's very useful for large geospatial datasets.

If we have some geospatial data:

library(tidyverse)  # includes ggplot2
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

# a few polygons
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)

# easy plotting
ggplot(nc) + geom_sf(aes(fill = AREA))

image

we can take a look at the column types:

# nc is a table
glimpse(nc)
#> Observations: 100
#> Variables: 15
#> $ AREA      <dbl> 0.114, 0.061, 0.143, 0.070, 0.153, 0.097, 0.062, 0.091, 0...
#> $ PERIMETER <dbl> 1.442, 1.231, 1.630, 2.968, 2.206, 1.670, 1.547, 1.284, 1...
#> $ CNTY_     <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836, 183...
#> $ CNTY_ID   <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836, 183...
#> $ NAME      <fct> Ashe, Alleghany, Surry, Currituck, Northampton, Hertford,...
#> $ FIPS      <fct> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 37073, 3...
#> $ FIPSNO    <dbl> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 37073, 3...
#> $ CRESS_ID  <int> 5, 3, 86, 27, 66, 46, 15, 37, 93, 85, 17, 79, 39, 73, 91,...
#> $ BIR74     <dbl> 1091, 487, 3188, 508, 1421, 1452, 286, 420, 968, 1612, 10...
#> $ SID74     <dbl> 1, 0, 5, 1, 9, 7, 0, 0, 4, 1, 2, 16, 4, 4, 4, 18, 3, 4, 1...
#> $ NWBIR74   <dbl> 10, 10, 208, 123, 1066, 954, 115, 254, 748, 160, 550, 124...
#> $ BIR79     <dbl> 1364, 542, 3616, 830, 1606, 1838, 350, 594, 1190, 2038, 1...
#> $ SID79     <dbl> 0, 3, 6, 2, 3, 5, 2, 2, 2, 5, 2, 5, 4, 4, 6, 17, 4, 7, 1,...
#> $ NWBIR79   <dbl> 19, 12, 260, 145, 1197, 1237, 139, 371, 844, 176, 597, 13...
#> $ geometry  <MULTIPOLYGON [°]> MULTIPOLYGON (((-81.47276 3..., MULTIPOLYGON...

the last column is the actual sf data, and it's formatted as:

# column 'geometry' contains rather complex sfc_MULTIPOLYGON object's
str(nc[1, "geometry", drop = TRUE])
#> sfc_MULTIPOLYGON of length 1; first list element: List of 1
#>  $ :List of 1
#>   ..$ : num [1:27, 1:2] -81.5 -81.5 -81.6 -81.6 -81.7 ...
#>  - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

# the data is buried deep within the object and contains few elements (polygon points)
str(nc[1, "geometry", drop = TRUE][[1]][[1]][[1]])
#>  num [1:27, 1:2] -81.5 -81.5 -81.6 -81.6 -81.7 ...

I think these sf elements are too complex to support natively in the fstlib library. The best option would probably be to regards these elements as blobs of binary data. So fstlib would recognize them as binary R data of type sf. The fst package can deserialize these blobs into geospatial data. But for other bindings, for example in Python, these columns would be meaningless...

But any future on-disk selection mechanisms would work on geospatial data, so that would be a nice addition, what do you think?

andschar commented 4 years ago

I don't know much about serialization and just came across this issue. But, would it help to use sf::st_as_binary(nc$geometry) and transform the Well-known text (WKT) into Well-known binary (WKB) encoding? https://r-spatial.github.io/sf/articles/sf1.html#wkb

MarcusKlik commented 4 years ago

Hi @andschar,

thanks, yes, the WNB format seems to be universal across languages, so that would be great to use!

Following the example above:

x <- nc[1, "geometry", drop = TRUE]

print(x)
#> Geometry set for 1 feature 
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -81.74107 ymin: 36.23436 xmax: -81.23989 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...

# convert to binary
x_bin <- sf::st_as_binary(x)

print(x_bin)
#> [[1]]
#>   [1] 01 06 00 00 00 01 00 00 00 01 03 00 00 00 01 00 00 00 1b 00 00 00 00 00 00
#>  [26] a0 41 5e 54 c0 00 00 00 60 ff 1d 42 40 00 00 00 20 9d 62 54 c0 00 00 00 80
#>  [51] e1 22 42 40 00 00 00 80 f7 63 54 c0 00 00 00 20 05 23 42 40 00 00 00 20 84
#>  [76] 68 54 c0 00 00 00 a0 9b 2b 42 40 00 00 00 c0 6d 6f 54 c0 00 00 00 00 26 32
#> [101] 42 40 00 00 00 a0 b0 6c 54 c0 00 00 00 40 63 3c 42 40 00 00 00 a0 fa 6c 54
#> [126] c0 00 00 00 c0 79 42 42 40 00 00 00 40 e1 6a 54 c0 00 00 00 a0 79 4b 42 40
#> [151] 00 00 00 60 19 56 54 c0 00 00 00 a0 53 49 42 40 00 00 00 20 3e 56 54 c0 00
#> [176] 00 00 60 da 44 42 40 00 00 00 20 c9 54 54 c0 00 00 00 40 c0 41 42 40 00 00
#> [201] 00 80 0d 54 54 c0 00 00 00 80 87 3d 42 40 00 00 00 00 0a 51 54 c0 00 00 00
#> [226] 60 f6 37 42 40 00 00 00 60 d2 50 54 c0 00 00 00 60 d8 33 42 40 00 00 00 80
#> [251] 67 4f 54 c0 00 00 00 c0 90 30 42 40 00 00 00 60 5a 4f 54 c0 00 00 00 40 c4
#> [276] 2e 42 40 00 00 00 60 e9 50 54 c0 00 00 00 e0 1b 2d 42 40 00 00 00 40 0e 55
#> [301] 54 c0 00 00 00 40 87 2e 42 40 00 00 00 c0 20 57 54 c0 00 00 00 60 34 2d 42
#> [326] 40 00 00 00 80 67 57 54 c0 00 00 00 00 66 2b 42 40 00 00 00 20 aa 56 54 c0
#> [351] 00 00 00 20 5d 26 42 40 00 00 00 60 84 57 54 c0 00 00 00 60 ac 23 42 40 00
#> [376] 00 00 40 02 5a 54 c0 00 00 00 a0 7c 24 42 40 00 00 00 a0 63 5a 54 c0 00 00
#> [401] 00 a0 36 22 42 40 00 00 00 20 96 5b 54 c0 00 00 00 40 5f 21 42 40 00 00 00
#> [426] 20 fc 5c 54 c0 00 00 00 c0 aa 1e 42 40 00 00 00 a0 41 5e 54 c0 00 00 00 60
#> [451] ff 1d 42 40
#> 
#> attr(,"class")
#> [1] "WKB"

# and back again
st_as_sfc(x_bin, EWKB = TRUE)
#> Geometry set for 1 feature 
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -81.74107 ymin: 36.23436 xmax: -81.23989 ymax: 36.58965
#> epsg (SRID):    NA
#> proj4string:    NA
#> MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...

some data seems to be missing after the round-trip, but that's just fine-tuning I think...

The performance might be a problem though:

# get sf element
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
x <- nc[1, "geometry", drop = TRUE]

# convert 1e4 elements
timing <- microbenchmark::microbenchmark(
  lapply(1:1e4, function(element) {
    # convert to binary
    sf::st_as_binary(x)
  })
)

# GB / s
1e4 * as.numeric(object.size(x)) / median(timing$time)
#> [1] 0.09141796

that's about 100 MB/s for encoding, slow but acceptable. For decoding we have

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

# get sf element
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
x <- nc[1, "geometry", drop = TRUE]

# convert to binary
x_bin <- sf::st_as_binary(x)

# convert back 1e4 elements
timing <- microbenchmark::microbenchmark(
  lapply(1:1e2, function(element) {
    # convert to binary
    st_as_sfc(x_bin, EWKB = TRUE)
  })
)

# GB / s
1e2 * as.numeric(object.size(x)) / median(timing$time)
#> [1] 0.02345609

That's pretty slow (23 MB/s) but perhaps there is a way to make this code run in parallel and speed things up.

Thanks for pointing out the WKB encoding option!

MarcusKlik commented 4 years ago

(and the format is not too complicated I see, so it's probably not very difficult to directly parse it...)

SimonCoulombe commented 4 years ago

I understand about 50% of what is going on and 0% of the amount of work involved to make this happen, but I am reading this with glee... :)