USDAForestService / FIESTA

FIESTA (Forest Inventory ESTimation and Analysis) is a research estimation tool for analysts that work with sample-based inventory data from the U.S. Department of Agriculture, Forest Service, Forest Inventory and Analysis (FIA) Program. Follow the link below for more information:
https://usdaforestservice.github.io/FIESTA/
23 stars 11 forks source link

add parallelization functionality to `spExtractPoly` #28

Closed joshyam-k closed 9 months ago

joshyam-k commented 9 months ago

Here's a reproducible example that just shows (at least in this example) that by parallelizing we don't change anything about the actual output from the function, we just change how we get there. I should also note that since the dataset is only about 50 rows here, the parallelization method is actually a tick slower as we'd expect.

devtools::install_github("joshyam-k/FIESTA")
library(FIESTA)

WYplt <- FIESTA::WYplt

# Get polygon vector layer from FIESTA external data
WYbhdistfn <- system.file("extdata",
                          "sp_data/WYbighorn_districtbnd.shp",
                          package = "FIESTA")

# Extract points from polygon vector layer
xyext_parallel <- spExtractPoly(xyplt = WYplt,
                       polyvlst = WYbhdistfn,
                       xy.uniqueid = "CN",
                       spMakeSpatial_opts = list(xvar = "LON_PUBLIC",
                                                 yvar = "LAT_PUBLIC",
                                                 xy.crs = 4269),
                       ncores = 8)$spxyext
#> Using 8 cores...

xyext <- spExtractPoly(xyplt = WYplt,
                       polyvlst = WYbhdistfn,
                       xy.uniqueid = "CN",
                       spMakeSpatial_opts = list(xvar = "LON_PUBLIC",
                                                 yvar = "LAT_PUBLIC",
                                                 xy.crs = 4269))$spxyext

identical(xyext, xyext_parallel)
#> [1] TRUE
ctoney commented 9 months ago

Nice. Thanks for the example. It runs correctly for me. I assume this gives substantial speedup on the 35 million points in NV?

spxyext <- spxyext[!duplicated(spxyext[[xy.uniqueid]]), ] instead of spxyext <- unique(sf::st_join(sppltx, polyv)) probably helps too even without parallel?

Looks good to merge.

joshyam-k commented 9 months ago

Above a million rows I was consistently seeing 5-10x speedups. And yes, reworking the removal of duplicate rows definitely speeds things up quite a bit even in the non parallel case.