sparklyr and spatial extension test

sigmafelix commented 1 year ago

sparklyr: a connector between spark and dplyr
- Supports spark installation function spark_install(), no pain in spark configuration
- Paired use with distributed databases (HIVE HDFS) will make a distributed computing workflow smoother and more efficient especially when we handle large (e.g., exceeding tens of gigabytes) geospatial datasets
- Tried standalone use in my local machine. Later I will try it on the HPC after getting access to it
- To orchestrate Spark and HDFS, the "gateway" module should be installed in "all nodes" in HPC; I am still figuring out what exactly we need to configure to use Spark and HDFS on NIEHS HPC. (cf. a figure about spark configuration in HPC)
- Will learn settings in detail (i.e., caching for datasets in frequent use, worker specification, etc.)
- How to use spark in apptainer: to mount ddn as an external drive in a container, then to assign memory and cpu resources to the container?
sedona: a spark extension for spatial data analysis (previously geospark)
- A pair of Java applications (in *.jar extension) that support geospatial capabilities in Spark engine
- apache.sedona offers functions to use sedona with sparklyr
  - User base seems not as solid as that of Python equivalent (19K total downloads in R vs 14M in Python)... may need to consider spark module in Python then find a way to connect that with the main R workflow.
- Direct input is supported for several geospatial data formats (Shapefile, geoparquet, and geojson)
- Currently there is an issue on converting a Spark table with a geometry column (usually a list with geometry in WKT strings) back to sf column as sparklyr strongly assumes that every column in a table is a vector.
More ideas
- GPU capability with rapids extension: no R API exists. Might need to make one from the scratch.

sigmafelix commented 1 year ago

Basic config and data input examples are below (from sparklyr homepage, reorganizing split code blocks into one)

if (!require(pacman)) {
    install.packages("pacman")
    library(pacman)
}

p_load(sparklyr, apache.sedona, catalog, dplyr, bench, nycflights13)
# sparklyr::spark_install(version="3.4.0")

conf <- spark_config()   # Load variable with spark_config()

# Setting "executor"
conf$spark.executor.memory <- "4G"
conf$spark.executor.cores <- 4
conf$spark.executor.instances <- 2
conf$spark.dynamicAllocation.enabled <- "false"

# Setting "cache"
# Total cache memory for repetitive access (not by file i/o)
conf$`sparklyr.shell.driver-memory` <- "16G"

sc <- spark_connect(master = "local",
                    config = conf)
# sparklyr dashboard: http://localhost:4040/

# Data load into Spark
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
airlines_tbl <- copy_to(sc, nycflights13::airlines, "airlines")

sigmafelix commented 1 year ago

Since Spark Sedona only accepts ESRI Shapefiles, WKT, Geoparquet, and GeoJSON, we will need to discuss the common data exchange format across the team. This needs to be done for both raster and vector data. Points of considerations are:

Is data integrity/geometric composition kept when a file is converted to another format? (e.g., is a strict grid structure is coerced as opposed to NetCDF's coordinate storing that allows "curvilinear" alignment)
Are all pieces of essential metadata (e.g., coordinate systems) kept in "new" formats

kyle-messier commented 1 year ago

This R-Spatial book has a good section on sf/stars working with large raster data cubes: https://r-spatial.org/book/09-Large.html
The Zarr file format is specifically designed for cloud and distributed file system
Zarr file formats could be stored on the CHORDS catalog and referenced from functions/scripts

sigmafelix commented 1 year ago

Thank you. I have been dealing with errors in converting apache.sedona's spatial resilient distributed dataset (SpatialRDD) back to a sf dataset with all attributes intact. The capability of apache.sedona and sparklyr is not as mature as the conventional approaches with sf or terra, so I will take the file-based (Zarr and geoparquet/geopackage) and multithreaded approach with sf or terra to calculate covariates for alpha version development at this moment.

sigmafelix commented 1 year ago

Test completed. May revisit the issue after figuring out alternative interfaces to Spark and Sedona (e.g., Python).

NIEHS / chopin

sparklyr and spatial extension test #2