DavisVaughan / furrr

Apply Mapping Functions in Parallel using Futures
https://furrr.futureverse.org/
Other
695 stars 39 forks source link

terra::rast() doesn't return a variable with workers > 1 #259

Open twest820 opened 1 year ago

twest820 commented 1 year ago

A repex isn't feasible because multiple gigabytes (actually multiple terabytes in the full use case) of data are involved but I have the following scenario

library(dplyr)
library(furrr)
library(sf)
library(terra)

plan(multisession, workers = 16)

simpleFeatureCollection = st_read("simpleFeatureCollection.gpkg")

with_progress({
  progressBar = progressor(steps = nrow(simpleFeatureCollection))

  future_map(simpleFeatureCollection$ID, function(polygonID)
  {
    regionOfInterestPolygon = (simpleFeatureCollection %>% filter(ID == polygonID))[1]
    mediumSizeRaster = rast("twoGBraster.tif")
    rasterRegionOfInterest = crop(mediumSizeRaster, regionOfInterestPolygon)

    <do computationally intensive things>

    progressBar(<update message>)
  })
})

which fails with

Error in (function (.x, .f, ..., .progress = FALSE)  : ℹ In index: 1.
Caused by error in `h()`:
! error in evaluating the argument 'x' in selecting a method for function 'crop': object 'mediumSizeRaster' not found

Same code runs fine with workers = 1. While this approach isn't ideal (it would likely waste 60+ GB of memory in duplicate copies of a raster which is thread safe since it sees only read access), the preferred implementation of hoisting rast() out of the function body fails with #258. Since I've got 128 GB of DDR and can afford to waste some is there a way to get rast() to construct an object under parallel execution?

From what I can see at the moment, the least undesirable workaround appears to be refactor the code for single threaded execution, manually chunk and balance the polygons, and then kick off 16 background jobs in RStudio using Code -> Run selection as background job. But, insofar as I understand furrr, that's the sort of task future_map() exists to automate.

DavisVaughan commented 1 year ago

I'm not sure how big simpleFeatureCollection is but one thing to keep in mind with your current approach is that you (probably) get 16 copies of it, one for each worker, and that could be expensive

DavisVaughan commented 1 year ago

Are you sure mediumSizeRaster = rast("twoGBraster.tif") is actually resulting in an object? It looks like it uses a relative path so the working directory on the worker may be different. You could try supplying an absolute path instead.

twest820 commented 1 year ago

If the error message for workers > 1 is to be believed, it appears somehow the call to rast() is getting skipped—even if the statement was executed and rast() had some silent error leading it to somehow return NULL instead of failing properly that should still result in the parser adding mediumSizeRaster as a workspace variable. So it seems like something might be going pretty badly wrong though, given future's limitations for flowing diagnostics from workers back to their caller, we might be stuck. (I find myself often wishing for plan(multithread) but that's not on furrr.)

If it was a pathing issue, which it presumably isn't since there's no issue with workers = 1, I'd expect to see something like the usual

Error: [rast] file does not exist: twoGBraster.tif
In addition: Warning message:
twoGBraster.tif: No such file or directory (GDAL error 4) 

come back. But future may not be able to route that.

I'm not sure how big simpleFeatureCollection is

Good question! It's only a couple MB, so negligible in this context—32 workers would be better but even 8 GB per worker is maybe asking too much (if this approach to the task had worked I was prepared to kill the future_map() and try with eight workers to get 16 GB DDR per worker if physical memory was going to be exceeded).