SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
27 stars 12 forks source link

Errors reading ArcticDEM mosaic #299

Closed jpswinski closed 1 year ago

jpswinski commented 1 year ago

Attempting to sample the mosaic raster has stopped working.

From the production servers, we see this error message:

2023-08-10 13:23:03 | ip=10.0.1.39 level=critical caller=geo.cpp:239 msg="GDAL ERROR 1: Line 62907: Didn't find expected attribute value."
2023-08-10 13:23:03 | ip=10.0.1.39 level=critical caller=GdalRaster.cpp:184 msg="Error sampling raster: Failed to opened raster: /vsis3/pgc-opendata-dems/arcticdem/mosaics/v3.0/2m/2m_dem_tiles.vrt:"

From a local run of the server code, the system core dumped:

#0  __memcpy_simd () at ../sysdeps/aarch64/multiarch/memcpy_advsimd.S:160
#1  0x0000ffffb5a8f7ac in CPLWriteFct(void*, unsigned long, unsigned long, void*) () from /usr/local/lib/libgdal.so.32
#2  0x0000ffffb6c60150 in ?? () from /lib/aarch64-linux-gnu/libcurl.so.4
#3  0x0000ffffb6c70bc4 in ?? () from /lib/aarch64-linux-gnu/libcurl.so.4
#4  0x0000ffffb6c7a204 in ?? () from /lib/aarch64-linux-gnu/libcurl.so.4
#5  0x0000ffffb6c7b180 in curl_multi_perform () from /lib/aarch64-linux-gnu/libcurl.so.4
#6  0x0000ffffb6c72150 in curl_easy_perform () from /lib/aarch64-linux-gnu/libcurl.so.4
#7  0x0000ffffb5a92970 in CPLHTTPFetchEx () from /usr/local/lib/libgdal.so.32
#8  0x0000ffffb613afa4 in HTTPOpen(GDALOpenInfo*) () from /usr/local/lib/libgdal.so.32
#9  0x0000ffffb664178c in GDALOpenEx () from /usr/local/lib/libgdal.so.32
#10 0x0000ffffb66893b4 in GDALDatasetPool::_RefDataset(char const*, GDALAccess, char const* const*, int, bool, char const*) () from /usr/local/lib/libgdal.so.32
#11 0x0000ffffb668a534 in GDALProxyPoolDataset::RefUnderlyingDataset(bool) const () from /usr/local/lib/libgdal.so.32
#12 0x0000ffffb668aff0 in GDALProxyPoolDataset::Create(char const*, char const* const*, GDALAccess, int, char const*) () from /usr/local/lib/libgdal.so.32
#13 0x0000ffffb5e2b890 in VRTSimpleSource::OpenSource() const () from /usr/local/lib/libgdal.so.32
#14 0x0000ffffb5e2bd6c in VRTSimpleSource::GetRasterBand() const () from /usr/local/lib/libgdal.so.32
#15 0x0000ffffb5e2c09c in VRTSimpleSource::GetSrcDstWindow(double, double, double, double, int, int, double*, double*, double*, double*, int*, int*, int*, int*, int*, int*, int*, int*, bool&) () from /usr/local/lib/libgdal.so.32
#16 0x0000ffffb5e2e708 in VRTComplexSource::RasterIO(GDALDataType, int, int, int, int, void*, int, int, GDALDataType, long long, long long, GDALRasterIOExtraArg*) ()
   from /usr/local/lib/libgdal.so.32
#17 0x0000ffffb5e222f8 in VRTSourcedRasterBand::IRasterIO(GDALRWFlag, int, int, int, int, void*, int, int, GDALDataType, long long, long long, GDALRasterIOExtraArg*) ()
   from /usr/local/lib/libgdal.so.32
#18 0x0000ffffb5e20cac in VRTSourcedRasterBand::IReadBlock(int, int, void*) () from /usr/local/lib/libgdal.so.32
#19 0x0000ffffb664c4d4 in GDALRasterBand::GetLockedBlockRef(int, int, int) () from /usr/local/lib/libgdal.so.32
#20 0x0000aaaab8ab2d70 in GdalRaster::readPixel(GdalRaster::Point const&) ()
#21 0x0000aaaab8ab3ff8 in GdalRaster::samplePOI(GdalRaster::Point const&) ()
#22 0x0000aaaab8ab4a80 in GeoRaster::getSamples(double, double, double, long, std::vector<RasterSample, std::allocator<RasterSample> >&, void*) ()
#23 0x0000aaaab8a96808 in RasterSampler::processRecord(RecordObject*, unsigned long) ()
#24 0x0000aaaab8a63944 in RecordDispatcher::dispatchRecord(RecordObject*) ()
#25 0x0000aaaab8a63d10 in RecordDispatcher::dispatcherThread(void*) ()
#26 0x0000ffffb83c4624 in start_thread (arg=0xaaaab8a63c48 <RecordDispatcher::dispatcherThread(void*)>) at pthread_create.c:477
jpswinski commented 1 year ago
parms = {
  'poly': [
      {'lon': -156.6430455278934, 'lat': 71.11303515926326},
      {'lon': -156.26446195120343, 'lat': 71.27727860723829},
      {'lon': -156.7080728245955, 'lat': 71.33780162227296},
      {'lon': -156.98688849401648, 'lat': 71.21627954209416},
      {'lon': -156.6430455278934, 'lat': 71.11303515926326}],
  't0': '2023-01-01T00:00:00Z',
  't1': '2024-01-01T00:00:00Z',
  'samples': {'mosaic': {'asset': 'arcticdem-mosaic', 'algorithm': 'NearestNeighbour'}}
}

gf = icesat2.atl06p(parms, version='006')

scottyhq commented 1 year ago

After some GDAL log sleuthing today with @elidwa and @dshean we noticed the PGC VRTs changed on Aug 9! And I'm fairly certain those changes are the root cause of the sliderule errors:

CPL_DEBUG=ON \
 GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR \
 AWS_NO_SIGN_REQUEST=YES \
 CPL_CURL_VERBOSE=ON \
 time \
 gdallocationinfo -wgs84 /vsis3/pgc-opendata-dems/arcticdem/mosaics/v3.0/2m/2m_dem_tiles.vrt -150 70

Relevant bits of log below:

CURL_INFO_HEADER_IN: Last-Modified: Wed, 09 Aug 2023 16:54:52 GMT

CURL_INFO_HEADER_OUT: GET /arcticdem/mosaics/v3.0/2m/2m_dem_tiles.vrt HTTP/1.1
Host: pgc-opendata-dems.s3.amazonaws.com
User-Agent: GDAL/3.7.0
Accept: */*
Range: bytes=16384-4934014

CURL_INFO_HEADER_OUT: GET /arcticdem/mosaics/v3.0/2m/46_19/46_19_2_2_2m_v3.0_reg_dem.tif HTTP/1.1
CURL_INFO_HEADER_IN: Accept-Ranges: bytes
CURL_INFO_HEADER_IN: Content-Type: image/tiff
CURL_INFO_HEADER_IN: Server: AmazonS3
CURL_INFO_HEADER_IN: Content-Length: 2031645595

# Report:
#  Location: (943312P,1766861L)
#  Band 1:
#    <LocationInfo></LocationInfo>
#    Value: 116.615013122559
# 1.63user 2.77system 1:52.05elapsed

This is a 2GB 'Content-Length' transfer! Not a Range request. So it takes over a minute to read 1 pixel from a COG!

The problem is that the entire TIF is downloaded rather than doing a byte range request. I think this is because the new VRTs do not specify /vsicurl/as a prefix to each TIF. If we make that change, getting a pixel value is ~0.5 seconds:

wget https://pgc-opendata-dems.s3.amazonaws.com/arcticdem/mosaics/v3.0/2m/2m_dem_tiles.vrt
sed 's,https:,/vsicurl/https:,g' 2m_dem_tiles.vrt > 2m_dem_tiles_v3.vrt
time gdallocationinfo -wgs84 2m_dem_tiles_v3.vrt -150 70

#    Value: 116.615013122559
#real    0m0.595s
#user    0m0.185s
#sys     0m0.039s
scottyhq commented 1 year ago

Also noting the same problem with the new ArcticDEM version 4.1 (https://www.pgc.umn.edu/data/arcticdem/).

(https://pgc-opendata-dems.s3.amazonaws.com/arcticdem/mosaics/v4.1/2m_dem_tiles.vrt)

elidwa commented 1 year ago

I created a branch arcticdem_mosaics_v4.1 which has the changes to code and tests to run with latest arcticdem mosaics v4.1 Unfortunately some tests cannot complete due to the box running out of virtual memory. In particular tests for calculating zonal stats or resampling POI over some area using different algos cause the loss of connectivity to remote server. This kind of makes sense with what Scott found out.

The vrt for v3.0 before PGC changed had relative tif paths: "07_40/07_40_2_2_2m_v3.0_reg_dem.tif"

while after updates it now has " https://pgc-opendata-dems.s3.us-west-2.amazonaws.com/arcticdem/mosaics/v4.1/2m/07_40/07_40_2_2_2m_v4.1_dem.tif"

GDAL vsis3 driver which uses curl cannot do it's byte range magic as Scott pointed out.

jpswinski commented 1 year ago

The ArcticDEM vrts have been updated by PGC to use absolute paths with /vsis3/. This has resolved the issue.