parallelism in create_image_collection + cube operations

cboettig commented 1 year ago

I'm working with a use case that involves a lot of looping over the pattern of first creating an image collection, basically looks like:

lapply(many_collections, \(collection_i) {
   gdalcubes::create_image_collection(collection_i) |>
   raster_cube(view) |> 
   select_bands(bands) |> 
   extract_geom(sites) |>
   write_csv(somewhere)

gdalcubes nicely breaks this up with two 'progress' bars, one during the create_image_collection(), and the other during the calculation on the raster cube.

That second part of the pipeline parallelizes rather well under the hood with gdalcubes_options(parallel=TRUE). It doesn't quite saturate my bandwidth or CPU, but at it usually achieves a few 100 MB/s and uses all available cores (though at pretty low utilization).

However, resource use when preparing the collection is far lower -- it always uses only 1 core and achieves only a few MB/s transfers. As a result, the create_image_collection() step takes almost as much time as the rest of the processing, which is making it difficult to achieve good use of computational resources. I have tried parallelizing my create_image_collection() step, (e.g. with mcapply or makePSOCKCluster, but I guess because of the use of the C++ pointers, this doesn't actually work -- the code runs well and uses more cores and achieve higher bandwidth, but the resulting image collections are invalid).

I've found I can parallelize the whole loop above, (while also using gdalcubes_options(parallel=TRUE)). This creates higher RAM use (not surprisingly) but does allow me to get better compute resource use: with 10 parallel processes during the raster_cube() part I can get 10 GB/s (running on AWS c6i.32xlarge instance) and hit 100% CPU utilization for all 128 cores. However, at this scale, most of the time is spent on create_image_collection(), where I'm back down to a paltry 5 - 6 MB/s transfers.

Maybe there is no way around the realities here, but it would be wonderful if you had any advice on improving computational use in this workflow. Maybe my case is uniquely bad because I'm pulling from grib files instead of COG files, though they are already on the AWS platform as well. (The many iterations arise because these are NOAA GEFS forecasts -- so the forecast produced on any given day involves 31 ensemble members, and I'd like to get all ensemble members in a forecast for all forecasts in a year. But each ensemble member in each 35-day forecast needs it's own "cube", since a probablistic forecast really has 2 extra dimensions beyond the standard geocube -- we have forecast horizon and uncertainty, in addition to band, datetime, and xy location). Any advice you have would be much appreciated. Reproducible code I'm using is here: https://github.com/eco4cast/gefs4cast/blob/main/inst/cog-examples/efi-snapshots.R, which is just a thin wrapper around gdalcubes basically implementing the pseudo-code pipeline I outlined above.

Thanks for any ticks or pointers!

appelmar commented 1 year ago

Thanks for sharing this interesting use case. Here are some thoughts and questions but I haven't tried to reproduce yet:

You are right, the creation of image collections is not parallelized, even if setting gdalcubes_options(parallel = TRUE). I think that in many cases (local data or cloud from STAC response), this is fine because the bottleneck is running the SQlite write operations, which would cause many locks and waiting in a parallel setting. I did some tests a while ago and concluded that it is not worth the effort, given that in many cases one has to create the image collection only once.
I think the low bandwidth can be explained by the access pattern when creating the image collection: All files are opened and only the metadata (extent, srs, etc.) is extracted, resulting in many HTTP requests and a lot of overhead. I am not really sure if this could help but maybe it is worth experimenting with the GDAL configuration options to optimize /vsicurl/ access (see e.g. here).
Possible improvements: First of all, I think that parallelization of the image collection creation in your case might improve the performance but probably not much (if not less) than parallelizing the whole loop (with independent image collections) as you have already tried. I am not really familiar with the dataset but if there is any way you can get the metadata (spatial extent, srs, datetime) without opening the files, it would likely be much faster to create the image collection manually (similar to stac_image_collection()) . Alternatively, if all files share the same srs and spatial extent, maybe even stack_cube() would be useful. Happy to help if this is an option.

Hope this helps!

cboettig commented 1 year ago

Thanks @appelmar , this is great. Sounds like passing the metadata manually will be the way to go, I was passing datetime but not spatial extent and srs, so I'll add that, and take a look at stack_cube too.

I'll poke at the VSICURL settings too, haven't dug into those -- really nice that you provide recommended choices here too!

appelmar commented 1 year ago

Thanks, just to add: passing the spatial extent and srs is not as easy as passing the datetime when creating an image collection. You basically have to build the complete metadata tables of the image collection as data.frames before (see stac_image_collection()). Let me know if I can help.

cboettig commented 1 year ago

Thanks for these pointers, @appelmar , I've gotten things to run over 10x faster with these improvements!

For the gribs, they are all global extent so stack_cube works marvelously and very simple. One thing I've noticed though is that while create_image_collection() + view approach is quite robust to things like a missing URL, stack_cube is less forgiving. Is there a natural way around this? (there are many thousands of URLs of course which we automatically generate using their predictable format even though NOAA doesn't provide STAC entries here, but this means one bad URL or bad file throws an error for the whole stack_cube() set. I can manually ping the URLs or list them from the bucket but it adds non-negligible overhead to do so).

Anyway, thanks again for this marvelous package.

appelmar commented 1 year ago

Thanks for the update, great to read that you could speed up things. FYI, stack_cube() should now (upcoming 0.6.4 release) also be more robust and simply skip images if not accessible.

appelmar / gdalcubes_cpp

parallelism in create_image_collection + cube operations #57