Open SimonCoulombe opened 4 years ago
Hi @SimonCoulombe, thanks for your question!
Yes, it would be extremely useful to be able to serialize simple features columns. That would make serialization of geospatial objects efficient. Also, it would be possible to do a selection of features on-disk, that's very useful for large geospatial datasets.
If we have some geospatial data:
library(tidyverse) # includes ggplot2
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
# a few polygons
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
# easy plotting
ggplot(nc) + geom_sf(aes(fill = AREA))
we can take a look at the column types:
# nc is a table
glimpse(nc)
#> Observations: 100
#> Variables: 15
#> $ AREA <dbl> 0.114, 0.061, 0.143, 0.070, 0.153, 0.097, 0.062, 0.091, 0...
#> $ PERIMETER <dbl> 1.442, 1.231, 1.630, 2.968, 2.206, 1.670, 1.547, 1.284, 1...
#> $ CNTY_ <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836, 183...
#> $ CNTY_ID <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836, 183...
#> $ NAME <fct> Ashe, Alleghany, Surry, Currituck, Northampton, Hertford,...
#> $ FIPS <fct> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 37073, 3...
#> $ FIPSNO <dbl> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 37073, 3...
#> $ CRESS_ID <int> 5, 3, 86, 27, 66, 46, 15, 37, 93, 85, 17, 79, 39, 73, 91,...
#> $ BIR74 <dbl> 1091, 487, 3188, 508, 1421, 1452, 286, 420, 968, 1612, 10...
#> $ SID74 <dbl> 1, 0, 5, 1, 9, 7, 0, 0, 4, 1, 2, 16, 4, 4, 4, 18, 3, 4, 1...
#> $ NWBIR74 <dbl> 10, 10, 208, 123, 1066, 954, 115, 254, 748, 160, 550, 124...
#> $ BIR79 <dbl> 1364, 542, 3616, 830, 1606, 1838, 350, 594, 1190, 2038, 1...
#> $ SID79 <dbl> 0, 3, 6, 2, 3, 5, 2, 2, 2, 5, 2, 5, 4, 4, 6, 17, 4, 7, 1,...
#> $ NWBIR79 <dbl> 19, 12, 260, 145, 1197, 1237, 139, 371, 844, 176, 597, 13...
#> $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-81.47276 3..., MULTIPOLYGON...
the last column is the actual sf
data, and it's formatted as:
# column 'geometry' contains rather complex sfc_MULTIPOLYGON object's
str(nc[1, "geometry", drop = TRUE])
#> sfc_MULTIPOLYGON of length 1; first list element: List of 1
#> $ :List of 1
#> ..$ : num [1:27, 1:2] -81.5 -81.5 -81.6 -81.6 -81.7 ...
#> - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
# the data is buried deep within the object and contains few elements (polygon points)
str(nc[1, "geometry", drop = TRUE][[1]][[1]][[1]])
#> num [1:27, 1:2] -81.5 -81.5 -81.6 -81.6 -81.7 ...
I think these sf
elements are too complex to support natively in the fstlib
library. The best option would probably be to regards these elements as blobs of binary data. So fstlib
would recognize them as binary R
data of type sf
. The fst
package can deserialize these blobs into geospatial data. But for other bindings, for example in Python
, these columns would be meaningless...
But any future on-disk selection mechanisms would work on geospatial data, so that would be a nice addition, what do you think?
I don't know much about serialization and just came across this issue. But, would it help to use sf::st_as_binary(nc$geometry)
and transform the Well-known text (WKT) into Well-known binary (WKB) encoding?
https://r-spatial.github.io/sf/articles/sf1.html#wkb
Hi @andschar,
thanks, yes, the WNB format seems to be universal across languages, so that would be great to use!
Following the example above:
x <- nc[1, "geometry", drop = TRUE]
print(x)
#> Geometry set for 1 feature
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -81.74107 ymin: 36.23436 xmax: -81.23989 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
#> MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
# convert to binary
x_bin <- sf::st_as_binary(x)
print(x_bin)
#> [[1]]
#> [1] 01 06 00 00 00 01 00 00 00 01 03 00 00 00 01 00 00 00 1b 00 00 00 00 00 00
#> [26] a0 41 5e 54 c0 00 00 00 60 ff 1d 42 40 00 00 00 20 9d 62 54 c0 00 00 00 80
#> [51] e1 22 42 40 00 00 00 80 f7 63 54 c0 00 00 00 20 05 23 42 40 00 00 00 20 84
#> [76] 68 54 c0 00 00 00 a0 9b 2b 42 40 00 00 00 c0 6d 6f 54 c0 00 00 00 00 26 32
#> [101] 42 40 00 00 00 a0 b0 6c 54 c0 00 00 00 40 63 3c 42 40 00 00 00 a0 fa 6c 54
#> [126] c0 00 00 00 c0 79 42 42 40 00 00 00 40 e1 6a 54 c0 00 00 00 a0 79 4b 42 40
#> [151] 00 00 00 60 19 56 54 c0 00 00 00 a0 53 49 42 40 00 00 00 20 3e 56 54 c0 00
#> [176] 00 00 60 da 44 42 40 00 00 00 20 c9 54 54 c0 00 00 00 40 c0 41 42 40 00 00
#> [201] 00 80 0d 54 54 c0 00 00 00 80 87 3d 42 40 00 00 00 00 0a 51 54 c0 00 00 00
#> [226] 60 f6 37 42 40 00 00 00 60 d2 50 54 c0 00 00 00 60 d8 33 42 40 00 00 00 80
#> [251] 67 4f 54 c0 00 00 00 c0 90 30 42 40 00 00 00 60 5a 4f 54 c0 00 00 00 40 c4
#> [276] 2e 42 40 00 00 00 60 e9 50 54 c0 00 00 00 e0 1b 2d 42 40 00 00 00 40 0e 55
#> [301] 54 c0 00 00 00 40 87 2e 42 40 00 00 00 c0 20 57 54 c0 00 00 00 60 34 2d 42
#> [326] 40 00 00 00 80 67 57 54 c0 00 00 00 00 66 2b 42 40 00 00 00 20 aa 56 54 c0
#> [351] 00 00 00 20 5d 26 42 40 00 00 00 60 84 57 54 c0 00 00 00 60 ac 23 42 40 00
#> [376] 00 00 40 02 5a 54 c0 00 00 00 a0 7c 24 42 40 00 00 00 a0 63 5a 54 c0 00 00
#> [401] 00 a0 36 22 42 40 00 00 00 20 96 5b 54 c0 00 00 00 40 5f 21 42 40 00 00 00
#> [426] 20 fc 5c 54 c0 00 00 00 c0 aa 1e 42 40 00 00 00 a0 41 5e 54 c0 00 00 00 60
#> [451] ff 1d 42 40
#>
#> attr(,"class")
#> [1] "WKB"
# and back again
st_as_sfc(x_bin, EWKB = TRUE)
#> Geometry set for 1 feature
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -81.74107 ymin: 36.23436 xmax: -81.23989 ymax: 36.58965
#> epsg (SRID): NA
#> proj4string: NA
#> MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
some data seems to be missing after the round-trip, but that's just fine-tuning I think...
The performance might be a problem though:
# get sf element
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
x <- nc[1, "geometry", drop = TRUE]
# convert 1e4 elements
timing <- microbenchmark::microbenchmark(
lapply(1:1e4, function(element) {
# convert to binary
sf::st_as_binary(x)
})
)
# GB / s
1e4 * as.numeric(object.size(x)) / median(timing$time)
#> [1] 0.09141796
that's about 100 MB/s for encoding, slow but acceptable. For decoding we have
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
# get sf element
nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
x <- nc[1, "geometry", drop = TRUE]
# convert to binary
x_bin <- sf::st_as_binary(x)
# convert back 1e4 elements
timing <- microbenchmark::microbenchmark(
lapply(1:1e2, function(element) {
# convert to binary
st_as_sfc(x_bin, EWKB = TRUE)
})
)
# GB / s
1e2 * as.numeric(object.size(x)) / median(timing$time)
#> [1] 0.02345609
That's pretty slow (23 MB/s) but perhaps there is a way to make this code run in parallel and speed things up.
Thanks for pointing out the WKB encoding option!
(and the format is not too complicated I see, so it's probably not very difficult to directly parse it...)
I understand about 50% of what is going on and 0% of the amount of work involved to make this happen, but I am reading this with glee... :)
I have a dataset of car trips.
The dataset has four columns ( trip_id timestamp, lat, lon) , and each trip_id will have a few hundreds rows with points showing the location every few second.
Is fst able to read/write geospatial data (such as 'sf' data frames), or do I have to save it as a regular data.frame ?
If not, a suggestion would be welcome :)