framework for point-extraction

mdsumner commented 6 months ago

(this is just a heads-up, in case there's other work in progress or for this to generate discussion)

I have a proto-point-in-cell lookup PR, this was modelled on 'gdallocationinfo' originally, but fit naturally in this package. I'm doing my own "point-to-cell-to-colrow-index" with my package vaster, but the the logic is trivial and could be inline here for methods that accept cell, col/row, or xy values (and we could emulate gdallocationinfo to automatically transform wgs84 input to the raster itself).

https://github.com/mdsumner/gdalraster/blob/435c0f4b819350d3b0f240a48cc748d5ab3db781/src/gdalraster.cpp#L931

It currently forces output to always be Float64, I'll just follow up with that as a todo for me to flesh this out.

I don't really know what it should be called, or how to differentiate col/row input from xy input for a function signature. It's competitive with terra::extract for local files. I have been experimenting with ways to batch the requests per tile for remote sources and parallelize those (that was my original motivation). Performance will be better when it's not always assuming Float64 output. I'd like to see gdallocationinfo librarified into the GDAL API itself, and I consider this a bit of an educational task for me to pursue that - but, with not too much work it probably is a good fit for gdalraster pretty soon.

It also implies the need for a bit of grid logic in gdalraster too, easy enough in R - find the col/row index of a point, my own strategy there was {vaster} to emulate the grid-logic functions in the raster package, but easy to inline that here.

Thanks!

To get col/row, something like this works with dimension = XSize, YSize and $bbox() and coords xy in the crs of the raster:

.index_point <- function(dimension, bbox, xy) {
    xx <- xy[, 1L, drop = TRUE]
    yy <- xy[, 2L, drop = TRUE]
    len <- length(xx)
    ncols <- dimension[1L]
    nrows <- dimension[2L]
    xmin <- bbox[1L]
    xmax <- bbox[3L]
    ymin <- bbox[2L]
    ymax <- bbox[4L]
    yres_inv <-  nrows/(ymax - ymin)
    xres_inv <-  ncols/(xmax - xmin)
    row <- floor((ymax - yy) * yres_inv)
    row <- ifelse(yy == ymin, nrows - 1, row)
    col <- floor((xx - xmin) * xres_inv)
    col <- ifelse(xx == xmax, ncols - 1, col)
    row <- ifelse(row < 0 | row >= nrows, NA_real_, row)
    col <- ifelse(col < 0 | col >= ncols, NA_real_, col)
    cbind(col = col, row = row) 
}

ctoney commented 6 months ago

Hi @mdsumner,

(this is just a heads-up, in case there's other work in progress or for this to generate discussion)

I have more comments on this topic for a another reply soon. I work for the national forest inventory in the US. Our grid of monitoring plots across all ownerships is the gold standard for forest remote sensing in the US. I only point that out to say that we do a lot of point extraction from raster (a near-daily workflow), typically for large point counts and large raster. I have a couple of versions of that code that go way back (Python and R), and it has been considered to move into gdalraster at some point.

That said, your solution is quite interesting and I would be happy to discuss more and potentially modernize my own code around that (i.e., switch to your solution and incorporate bits that I need for our workflows). In particular,

ways to batch the requests per tile for remote sources and parallelize those (that was my original motivation)

is something that I need now for production work and is on my near-term to-do list (so this issue is timely, thanks!)

find the col/row index of a point

Have you looked at gdalraster::get_pixel_line(xy, gt) for input a matrix or data frame of xy? (implemented in C++)

We also have gdalraster::inv_geotransform(gt), wrapper of GDALInvGeoTransform() (which get_pixel_line() uses - this should be basically 'gdallocationinfo' internals). And we have gdalraster::inv_project().

mdsumner commented 6 months ago

Oh, indeed I should have looked more deeply. Will use those methods, that's perfect 🙏

And, really keen to see your methods. To me it comes down to grouping by block and reading into mem, then doing local extract (by point or by converted index), or grouping by block and sending GDAL RasterIO off to do it (global index).

I expect the first is faster, but will depend on how it's chunked and the balance of how many points there are per block (perhaps that's a possible optimization).

I'm not sure about indexing the points, and the need for localizing (block col/row vs global), but indexing is definitely the go for repeated extraction over time so depends on the problem, and I think being able to convert between global and local index is useful.

Then there's bilinear vs nearest neighbour but that can be schemed later 😄

ctoney commented 5 months ago

@mdsumner Apologies, just getting back to this. I wanted to check in on status and how you would like to proceed.

I don't really know what it should be called, or how to differentiate col/row input from xy input for a function signature.

read_col_row() is fine. Possibly read_cell() or read_pixel()? Could also consider making this a C++ stand-alone function, optionally with an R wrapper in front of it. It could accept aGDALRaster object as parameter. The reason to consider that would be more flexibility in having (possibly) several optional function parameters with default values, and maybe easier to handle "differentiate col/row input from xy input for a function signature". It's also possible that this could be all in R with minimal implications for performance.

My current R code for this is in package FIESTAutils (the pixel extract code is old, and basically just a port of even older Python code). I have some internal functions in this package: https://github.com/USDAForestService/FIESTAutils/blob/main/R/raster_analysis.R

getPixelValue() is at line 131. It has bilinear interpolation at line 216.

My current need for production work is to modernize the pixel extraction workflow a bit, and move from Python to R-based. Mainly that the raster data would be read from URL/cloud storage instead of processed locally (~50 to 100 bands of imagery, DEM derivatives, etc at a time). We extract values for point data that have ~100,000 to 1,000,000 xy in the same projection as the raster data. In some cases, the extracted values need to use bilinear interpolation or be derived from a kernel of 3x3, 5x5 etc. This is done frequently with a need for quick turnaround, so making it as efficient as possible for the case of raster data on remote file systems is the current task I'm working on.

mdsumner commented 5 months ago

I wonder if it should read a block and do local in-memory indexing, or read a block to an in memory file and run the RasterIO calls on that, either 1x1 or 2x2 windows with "local" get pixel/row calls.

There's a lot of ways this could be done. I'll have a try. Generally I think I want to work on remote sources, so getting blocks makes sense but it's very different if the block is a gdaldataset vs an array 👌, and there's neat indexing tricks for working in memory with R arrays for bilinear.

mdsumner commented 4 months ago

just to note, I will get back to this - just had a lot going on but my explorations have been promising. I think it definitely makes sense to batch a large number of points into the existing tiling (or a super-tiling of that) and do the indexing in memory after reading a whole block, but there case of iterating over every point for a 1-cell read is still useful.

mdsumner commented 3 months ago

new GDAL InterpolateAtPoint for reference:

https://github.com/OSGeo/gdal/pull/10461

USDAForestService / gdalraster

framework for point-extraction #338