ateucher / rmapshaper

An R wrapper for the mapshaper javascript library
http://andyteucher.ca/rmapshaper/
Other
200 stars 13 forks source link

Performance on large files #61

Closed Robinlovelace closed 6 years ago

Robinlovelace commented 7 years ago

I've just been using this on some moderately large shapfiles.

Here's some reproducible code to demo the issue:

u = "https://borders.ukdataservice.ac.uk/ukborders/easy_download/prebuilt/shape/England_ua_caswa_2001_clipped.zip"
download.file(u, destfile = "zipped_shapefile.zip")
unzip("zipped_shapefile.zip")
f = list.files(pattern = ".shp")
res = sf::st_read(f)
# cas2003_simple = rmapshaper::ms_simplify(input = res, keep = 0.05) # didn't finish...
st_write(res, "cas2003.shp")
ms_msg = "mapshaper cas2003.shp -simplify dp 5% -o format=geojson cas2003.json"
[simplify] Repaired 437 intersections; 12 intersections could not be repaired
[o] Wrote cas2003.json
   user  system elapsed 
  7.984   0.248   8.164 
cas2003_simple = st_read("cas2003.json")

Context: I'm frustrated at the poor provision of UK administrative borders in suitable file formats or levels of simplification and have started a package to deal with it:

https://github.com/Robinlovelace/ukborders

Thanks loads for rmapshaper in any case!

ateucher commented 7 years ago

Hi @Robinlovelace - thanks for the report, and the reprex. I have definitely run into issues with rmapshaper being unable to handle really large spatial objects, but in this case I had no trouble (See below). Can you output your sessionInfo()?

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3
library(rmapshaper)
u = "https://borders.ukdataservice.ac.uk/ukborders/easy_download/prebuilt/shape/England_ua_caswa_2001_clipped.zip"
download.file(u, destfile = "zipped_shapefile.zip")
unzip("zipped_shapefile.zip")
f = list.files(pattern = ".shp")
res = sf::st_read(f)
#> Reading layer `england_ua_caswa_2001_clipped' from data source `/private/var/folders/2w/x5wq73f93yzgm7hjr_b_54q00000gp/T/RtmpC0DMCy/england_ua_caswa_2001_clipped.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 1061 features and 5 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: 243367 ymin: 50322 xmax: 595739.3 ymax: 537152
#> epsg (SRID):    NA
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs

system.time(cas2003_simple <- rmapshaper::ms_simplify(input = res, keep = 0.05))
#>    user  system elapsed 
#>  12.505   1.409  14.181

object.size(res)
#> 27084984 bytes
object.size(cas2003_simple)
#> 2205128 bytes

plot(cas2003_simple[, "name"])

ateucher commented 7 years ago

Also related to #59

Robinlovelace commented 7 years ago

Apologies, I provided the smaller of the 2 .shp files (not tested). Please try on this bigger one:

u = "https://borders.ukdataservice.ac.uk/ukborders/easy_download/prebuilt/shape/England_caswa_2001_clipped.zip"
ateucher commented 7 years ago

Ah, that is a different beast. It did complete for me, but it was definitely slow:

system.time(cas2003_simple <- rmapshaper::ms_simplify(input = res, keep = 0.05))
#>    user   system  elapsed 
#> 1740.608   64.060 1802.608

I'll see if I can find anything that can do better, but the bottleneck is converting to geojson before sending it into the V8 context to be processed by the mapshaper javascript library.

Robinlovelace commented 7 years ago

Could there not be an option to do it via read/write and system()?

I recall doing a bodge involving that (I think you helped me with it!) and it was pretty fast.

ateucher commented 7 years ago

There could for sure, I’ve definitely thought about it. Would require some thought about how to make sure the system mapshaper API is consistent with the commands generated in rmapshaper.

ateucher commented 6 years ago

I'm working on optionally using the system mapshaper in the mapshaper_v0.4.x branch, along with upgrading to the latest version of mapshaper.

Robinlovelace commented 6 years ago

Great news - let me know when it's ready to test and I can do some benchmarks. For 9 in 10 use cases it's unlikely to make any difference so 100% behind it being an option for people handling chunky files. Thanks loads!

ateucher commented 6 years ago

Hi @Robinlovelace - I've finally got around to implementing this. If you are willing to give it a bit of a test drive, that would be great! You can install_github("ateucher/rmapshaper", ref = "mapshaper_v0.4.x"). The argument is sys and is available in all functions. Default is FALSE, so if you set it to TRUE it should work!

Robinlovelace commented 6 years ago

Fantastic work @ateucher this is 4 times faster on the smaller of the examples - will make life easier for ppl wanting to simplify giant vector datasets. Can test on larger dataset at some point but from sketchy wifi connection on laptop, it expectations! Reprex showing the speed-up:

devtools::install_github("ateucher/rmapshaper", ref = "mapshaper_v0.4.x")
#> Skipping install of 'rmapshaper' from a github remote, the SHA1 (d197a3b7) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(sf)
#> Linking to GEOS 3.5.1, GDAL 2.2.2, proj.4 4.9.2
library(rmapshaper)
u = "https://borders.ukdataservice.ac.uk/ukborders/easy_download/prebuilt/shape/England_ua_caswa_2001_clipped.zip"
download.file(u, destfile = "zipped_shapefile.zip")
unzip("zipped_shapefile.zip")
f = list.files(pattern = ".shp")
res = sf::st_read(f)
#> Reading layer `england_ua_caswa_2001_clipped' from data source `/tmp/RtmpiO2NE3/england_ua_caswa_2001_clipped.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 1061 features and 5 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: 243367 ymin: 50322 xmax: 595739.3 ymax: 537152
#> epsg (SRID):    NA
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs
system.time(cas2003_simple <- rmapshaper::ms_simplify(input = res, keep = 0.05))
#>    user  system elapsed 
#>  34.572   2.196  36.980
system.time(cas2003_simple_sys <- rmapshaper::ms_simplify(input = res, keep = 0.05, 
  sys = T))
#>    user  system elapsed 
#>   8.996   0.208   9.154
object.size(res)
#> 27084984 bytes
object.size(cas2003_simple)
#> 2205128 bytes
identical(cas2003_simple, cas2003_simple_sys)
#> [1] TRUE
plot(cas2003_simple[, "name"])

ateucher commented 6 years ago

Excellent! I ran it with the big version and this is what I get:

library(sf)
#> Linking to GEOS 3.6.2, GDAL 2.2.3, proj.4 4.9.3
library(rmapshaper)

u = "https://borders.ukdataservice.ac.uk/ukborders/easy_download/prebuilt/shape/England_caswa_2001_clipped.zip"
download.file(u, destfile = "zipped_shapefile.zip")
unzip("zipped_shapefile.zip")
f = list.files(pattern = ".shp")
res = sf::st_read(f)
#> Reading layer `england_caswa_2001_clipped' from data source `/private/var/folders/2w/x5wq73f93yzgm7hjr_b_54q00000gp/T/RtmpZLkA8z/england_caswa_2001_clipped.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 6930 features and 5 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: 85665 ymin: 7054 xmax: 655604 ymax: 657534.1
#> epsg (SRID):    NA
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs

system.time(cas2003_simple_sys <- ms_simplify(input = res, keep = 0.05, sys = TRUE))
#>    user  system elapsed 
#>  57.382   2.670  59.937

system.time(cas2003_simple_internal <- ms_simplify(input = res, keep = 0.05, sys = FALSE))
#>     user   system  elapsed 
#> 1276.978   60.552 1514.478

object.size(cas2003_simple_sys)
#> 16266760 bytes
object.size(cas2003_simple_internal)
#> 16266760 bytes

Still not super fast but much better; the bottleneck is writing to geojson with sf::st_write...

Robinlovelace commented 6 years ago

Great - exceeds expectations! One idea you may have already tried is writing to different formats. I think this is a great solution in any case, many thanks for the fix.

ateucher commented 6 years ago

@Robinlovelace rmapshaper 0.4 is now on CRAN with the sys argument available in all functions.

Robinlovelace commented 6 years ago

Fantastic, many thanks! Heads-up @Nowosad we should mention this in the simplification section of the book.