Partial composites with a subset of tiles from a full tile_grid?

sheffe commented 5 years ago

I'm a few days into heavy use of slippymath and really enjoying it -- thank you! FWIW, I'm using it to rapidly construct training sets for some image analysis prototyping -- I don't know if you had that use-case in mind when writing the package, but I'm writing <50 lines of code to get a fantastic training set with zero overhead. Mind-blowing.

With that application in mind, I started pulling and caching huge numbers of tiles at relatively high zoom. Even with only ~10k tiles pulled, compositing everything back into a raster (to recover the spatial information) is not trivial -- that amounts to a 25,600px square image. For larger areas I'm doubtful that it's a good idea to try. The next step for many types of image analyses would be to take many bite-size chunks from that large image and crop/rotate/coarsen/etc to preprocess, so compositing the full raster is a bottleneck that isn't really needed other than to return the image tile files to a raster object with the CRS/extent/resolution embedded correctly.

I've experimented with the function compose_tile_grid and subsets of the tile_grid object, and I run into trouble when I don't pass in the complete tile_grid object in order to accurately staple tiles back into a raster with the spatial information correctly attributed.

I'm working through package internals and think I understand what's going on here. (I think) the bbox_to_tile_grid object transformation is exactly invertible. If that's true, then (borrowing from the README) we could probably take a statement like this: tile_grid <- bbox_to_tile_grid(uluru_bbox, max_tiles = 15) and come up with some corresponding process that turns a tile_grid into (eg) an sf polygon dataframe describing the precise boundaries of each component tile. That would allow for rasterizing each tile independently, with full spatial information for the specific tile that can be merged (or not) into a larger raster.

That would also permit informed subsetting of which tiles to look at in a subsample of a larger slippymath pull. For example, this kind of process would be broadly useful:

take a random subset of points within an area of interest,
find the tile polygons covering the points,
grab the tiles surrounding the covering tile in an NxN box => we get an enriched training set of many partially-overlapping images.

It looks like you've done a lot of work to remove the sf dependency before CRAN release, so this line of thinking may be better interpreted in spirit than implementation. Does the overall concept make sense? My use-case might be too idiosyncratic for a PR into the package, but if my premises seem directionally correct to you, then I'll plan to write up some logic to implement this independently. I can link a gist in case you're interested.

Thanks again for all of your work on this.

MilesMcBain commented 5 years ago

Hey thanks for raising this, I am glad you are getting some use out of slippymath!

Compositing large numbers of tiles is definitely not a great idea with the package as it stands now. compose_tile_grid uses a simple call to raster::merge() which has some undesirable memory management practices that can really blow out processing time.

Regarding rasterising individual tiles: There's probably a function missing to do that. Right now you can get a list of sf bounding boxes for every tile in grid using tile_grid_bboxes, and you have your list of tile images you've pulled down, but you're kind of on your own from there to combine those pieces to make individual spatially referenced rasters.

However code to make those individual rasters already lives in compose_tile_grid. I mean this bit:

                        raster_img <-
                            raster::brick(image,
                                          crs = attr(bbox, "crs")$proj4string)
                        raster::extent(raster_img) <-
                            raster::extent(bbox[c("xmin", "xmax", "ymin", "ymax")])
                        raster_img

So if we pull that out into a new function it can be used to combine the pieces.

The one thing this wouldn't give you that you mentioned are polygons. Do you just need them as a step toward the individual rasters or are they important in their own right?

Edit: Have you checked out ceramic? It builds on slippymath and has some nice features like caching, and fetching tiles given many different kinds of spatial objects - not just bounding boxes.

sheffe commented 5 years ago

Edit: Have you checked out ceramic?

If I had a nickel for every time Michael Sumner had already written the thing I needed but didn't know the correct search terms to describe... This looks great. I'm hoping to use non-Mapbox sources but the framework looks spot-on. (I want to download a ton of tiles to semi-permanent storage, which runs against Mapbox ToS.)

With your answer above and the ceramic pointer, I can see two ways to solve the issue I raised -- will experiment further. Worth checking first: @mdsumner do you anticipate extending the ceramic package to sources outside of Mapbox?

Answering your other question:

The one thing this wouldn't give you that you mentioned are polygons. Do you just need them as a step toward the individual rasters or are they important in their own right?

The pointer to compose_tile_grid is enough, I think -- I can make a polygon from the bbox of each tile and proceed from there. I just need to work tile-by-tile, but it's a small modification of the existing logic. If I get something working, would you want this feature in the package? (Totally cool if you want to keep the footprint small, especially since ceramic does a lot of the same job.)

The background use-case -- I'm definitely interested in the polygons in their own right. It might be an idiosyncratic workflow. I do lots of raster computations using data stored on S3, so it's hard to merge together a huge raster once, leave it out of memory, and load cropped subsets for processing. (Most of my work uses many small and frequently-overlapping subsets, and I process them in parallel, so I rarely need to merge more than a few tiles at a time.) I usually leave the raw data in small tiles and create an index file for loading only what I need in a batch. The basic idea: if I have a directory of N tiles, representing some combination of metadata (zoom-level, map-type, bounding box, etc), I store the filename/S3 key, metadata columns, and sf POLYGONs from the raster extents/bboxes. I can leave all of the raw data on S3 and use the index for (eg) intersecting to focal areas or defining a point sampling scheme. Arbitrary metadata can be added this way -- for example, maybe we want a balanced training set on a range of population densities, so stratify our image sampling accordingly. It also scales pretty well to large sets of tiles indexed in a DB. Having a PostGIS table of raster boundary polygons pointing to spatial data in S3 can seem odd, but S3 is dirt-cheap and fast, and S3/EC2 transfer is free.

mdsumner commented 5 years ago

I actually just added 'general' branch which now includes an AWS source to ceramic, the one used by elevatr.

The fastest way to merge is via VRT and use of GDAL's lazy tools. I'm going to separate that out in ceramic soon.

MilesMcBain / slippymath

Partial composites with a subset of tiles from a full tile_grid? #12