SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
26 stars 12 forks source link

integrated arrow sampler into the icesat2 and gedi endpoints #422

Closed jpswinski closed 1 month ago

jpswinski commented 1 month ago

Updates:

Benchmark:

Running 1 granule through David's notebook that has 63885 points of interest and samples 3DEP that has 23076 rasters, here are the timings I get -

Output SingleStop w/ Caching MultiStop w/o Caching
Parquet (ArrowSampler) 12 min, 13 sec 18 min, 5 sec
Stream (RasterSampler) 11 min, 20 sec 15 min, 35 sec

Note: the "SingleStop w/ Caching" option bypassed the linear feature search for each point of interest about 85% of the time

jpswinski commented 1 month ago

wrzesien_mountain_snow_sieve_poly.zip

jpswinski commented 1 month ago
region_gdf = gpd.read_file('./wrzesien_mountain_snow_sieve_poly.geojson')
region = sliderule.toregion(region_gdf, cellsize=0.01)
earthdata.set_max_resources(999999)
usgs3dep_catalog = earthdata.tnm(short_name='Digital Elevation Model (DEM) 1 meter', time_start=time_start, time_end=time_end, polygon=region['poly'])
dshean commented 1 month ago

One note, I think .set_max_resources(-1) is cleaner way to set unlimited

elidwa commented 1 month ago

Implemented parallel searching of feature list. Implemented onlyFirst with caching in index raster base class. All derived index raster plugins support it. Temporal filter is done when index file is opened and feature list is created.

During testing discovered that searchearth tnm server ignores temporal range. It returns all rasters for AOI. In this case 23,076 rasters were returned. When the same temporal range was given to raster sampler, the sampler code filtered out most rasters and only kept 4,191 which were in temporal range.

If tnm server is queried with temporal range, the same range must be passed to raster constructor to filter out invalid rasters.

These changes optimize searching of feature list. Keeping it small, reading in parallel, caching if onlyFirst. TODO: batch raster sampling.

jpswinski commented 1 month ago

Merged into main

dshean commented 1 month ago

All sounds great @elidwa! Seems like we might want to file a ticket with TNM API team about the temporal filter in the API query. Maybe something having to do with "tile creation/modification date" vs. "source data date." Ideally we would filter on the latter, but that is not available in the 3DEP 1m tile index, as many tiles include data from multiple lidar acquisitions.