Use large_image to subsample imagery

banesullivan commented 3 years ago

Switch the subsampling endpoints/backend to leverage large_image rather than the code in rgd/geodata/models/imagery/subsample.py

large_image has a getRegion function. It can take the corner points (in pixel space or in a projection's coordinates) and the desired output resolution or magnification. It outputs a rectangular image or NumPy array.

Since getRegion doesn't handle GeoJSON in any way -- just min/max values in x and y in pixel or projection coordinates, we will simply use the min/max bounds of the GeoJSON geometry for these endpoints.

Work to be done in RGD

[ ] Create new endpoints that accept an image ID and four values for min/max coords
- This will need to have two versions: image vs world coords
  - api/geoprocess/imagery/<int:pk>/region/world/<xmin>/<xmax>/<ymin>/<ymax>
  - api/geoprocess/imagery/<int:pk>/region/pixel/<umin>/<umax>/<vmin>/<vmax>
- Use large_image's getRegion to handle generating a sumbsampled image from the given parameters
- Return the extracted region image file from the endpoint
  - can this be done in the request thread or should this be a task? cc @manthey
  - If a task, we may need to create a new ImageFile and ImageEntry. Note that this issue is why I added SubsampledImage in the first place
[ ] Create another endpoint for using an annotation as the ROI
- api/geoprocess/imagery/<int:pk>/region/annotation/<int:annotation_id>
- This will basically wrap the pixel coordinates ROI endpoint with the min/max of the annotation bounds

Work to be done directly in `large_image`

[x] The output of getRegion() doesn't have any geospatial information attached to it -- we need to set it to output the region as a Cloud Optimized GeoTiff (or regular GeoTiff)

subdavis commented 3 years ago

Based on the technical requirements, I may be misunderstanding the purpose of this feature

I thought the point of this was to create subsamples for later retrieval. If creating ImageFile and ImageEntry is optional, then the caller must expect immediate results. Should this endpoint be idempotent and return a previously generated sub-sample if exists?
Likewise, I don't see how removing SubsampledImage is an option. You need somewhere to store the pointer to the ImageEntry with the parameters that created it. I think you can remove all the extra stuff that goes along with that model and have it be a normal rest model that triggers a job.
If this happens in a worker job, there needs to be a new endpoint not listed here. The two you listed generate new SubsampledImage models that trigger jobs to be created. They return the model, which has a nullable URI reference to an ImageEntry which will be null. Hitting the same endpoint again (or maybe adding a get-by-id endpoint for subsampled imagery) will have to be tried until the job completes and the new URI populates. Is this correct?
Won't creating a new ImageEntry cause duplicate data to appear in the search results? Does a new field need to be added to indicate that an image is not a derived product? Should a left outer join be performed?

subdavis commented 3 years ago

Answers:

No need to remove Subsampled Image.
No need to create new imagefile or imageentry yet. Data can go directly into a checksum file.
Mostly this work just involves replacing the task_funcs to use large_image and remove most of what's in subsample.py

aashish24 commented 3 years ago

(for future) I am wondering if @manthey will consider adding support for taking geojson as input for getRegion?

manthey commented 3 years ago

Currently large_image getRegion can take coordinates in a variety of units (pixel space, physical distance in pixel space, any projection for geospatial entries). It can optionally scale the results. It currently only outputs a numpy array or an image and has the limitation that the entire region has to fit in memory (and, if outputting an image, the image has to be able to be created with PIL, which effectively limits it to 1 gigapixel). Some of the output can be transparent (of, more precisely, the output is likely to be in RGBA or LA color space).

To properly support RGD, we need to output a COG (not a tif through PIL) which includes geolocation data. For generality, for non-geospatial data large_image's getRegion should also be able to output pyramidal tiffs. Conceptually this isn't hard.

getRegion currently outputs a single image, I think with the limitation that the output is never more than 4 channels (RGBA), but I'll have to confirm that is true with numpy output as well. For hyperspectral files, this means that we might be picking no more than three channels (though you can composite the data as part of the process via the style options). For palettized files (e.g., land use data where the pixel values are categories), getRegion outputs still outputs an RGBA image -- this means that the categorical information can be lost. If we need to preserve all of this, it means that when we output a region from large_image, the gdal tilesource might override the getRegion method when COG is requested and internally use GDAL to do the work. I think we'd only ever want to do this if no styling was applied (i.e., the bands aren't being remapped as part of the getRegion request).

If we add geoJSON region selection, I'd want to support that for all tile sources.

manthey commented 3 years ago

See https://github.com/girder/large_image/issues/567 and https://github.com/girder/large_image/issues/566

manthey commented 3 years ago

See https://github.com/girder/large_image/pull/594.

manthey commented 3 years ago

@banesullivan The PR on large_image is ready to be used and has basic tests. You can pull from that PRs branch to try things out.

banesullivan commented 3 years ago

Great! Thanks for the update, I will start testing this

banesullivan commented 3 years ago

WIP in https://github.com/ResonantGeoData/ResonantGeoData/pull/346

ResonantGeoData / RGD-ScrumBoard