NIEHS / chopin

Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing
https://niehs.github.io/chopin/
Other
10 stars 3 forks source link

sparklyr and spatial extension test #2

Closed sigmafelix closed 1 year ago

sigmafelix commented 1 year ago
sigmafelix commented 1 year ago

Basic config and data input examples are below (from sparklyr homepage, reorganizing split code blocks into one)

if (!require(pacman)) {
    install.packages("pacman")
    library(pacman)
}

p_load(sparklyr, apache.sedona, catalog, dplyr, bench, nycflights13)
# sparklyr::spark_install(version="3.4.0")

conf <- spark_config()   # Load variable with spark_config()

# Setting "executor"
conf$spark.executor.memory <- "4G"
conf$spark.executor.cores <- 4
conf$spark.executor.instances <- 2
conf$spark.dynamicAllocation.enabled <- "false"

# Setting "cache"
# Total cache memory for repetitive access (not by file i/o)
conf$`sparklyr.shell.driver-memory` <- "16G"

sc <- spark_connect(master = "local",
                    config = conf)
# sparklyr dashboard: http://localhost:4040/

# Data load into Spark
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
airlines_tbl <- copy_to(sc, nycflights13::airlines, "airlines")
sigmafelix commented 1 year ago

Since Spark Sedona only accepts ESRI Shapefiles, WKT, Geoparquet, and GeoJSON, we will need to discuss the common data exchange format across the team. This needs to be done for both raster and vector data. Points of considerations are:

kyle-messier commented 1 year ago
  1. This R-Spatial book has a good section on sf/stars working with large raster data cubes: https://r-spatial.org/book/09-Large.html
  2. The Zarr file format is specifically designed for cloud and distributed file system
  3. Zarr file formats could be stored on the CHORDS catalog and referenced from functions/scripts
sigmafelix commented 1 year ago

Thank you. I have been dealing with errors in converting apache.sedona's spatial resilient distributed dataset (SpatialRDD) back to a sf dataset with all attributes intact. The capability of apache.sedona and sparklyr is not as mature as the conventional approaches with sf or terra, so I will take the file-based (Zarr and geoparquet/geopackage) and multithreaded approach with sf or terra to calculate covariates for alpha version development at this moment.

sigmafelix commented 1 year ago

Test completed. May revisit the issue after figuring out alternative interfaces to Spark and Sedona (e.g., Python).