Open alexander-petkov opened 4 years ago
So I THINK that the S3-geotiff (COGeo) is basically just a tiled and compressed GeoTIFF and most of that can be done using gdal_translate.
On Tue, Sep 24, 2019 at 11:32 PM alexander-petkov notifications@github.com wrote:
Experiment with raster format optimization for OGC services. This is meant to be space for keeping notes with regards to different data format--ease of configuration, space, optimized data delivery, etc.
Configure a method to gather metrics with respect to using different raster formats. Add S3-geotiff to the mix: https://www.cogeo.org
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3DYP2RPUYKN3EHW56FTQLLZX3A5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNPYGVA, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3D5QWG7MAKNLE7HBUZTQLLZX3ANCNFSM4I2H3ZWQ .
ImageMosaic index store performance
Postgis vs H2 db indexing performance for mosaics--blue is Postgis, red is H2. The raster data are 206 Geotiffs for the gfs .25 deg Total_precipitation layer
The data is gathered via GetTimeSeries request over 100 runs for a point
{ for t in `seq 1 100`;do /usr/bin/time -p curl "http://localhost:8080/geoserver/wms?SERVICE=WMS&VERSION=1.1.1&REQUEST=GetTimeSeries&FORMAT=text%2Fcsv&TIME=2019-09-20T03:00:00.000Z/2019-10-06T00:00:00.000Z&QUERY_LAYERS=test:Total_precipitation&STYLES&LAYERS=test:Total_precipitation&INFO_FORMAT=text%2Fcsv&FEATURE_COUNT=1&X=1&Y=1&SRS=EPSG%3A4326&WIDTH=1&HEIGHT=1&BBOX=-116.%2C37%2C-115%2C38" ;done } 2> postgis_gtiff_mosaic.log
Postgis has an edge. In either case, not bad at all for loading and reading through 206 images!!
Seems like Postgis is the winner for this task. Pretty fast.... Nice comparison Alex.
MJ
On Thu, Oct 3, 2019 at 7:43 PM alexander-petkov notifications@github.com wrote:
ImageMosaic index store performance
Postgis vs H2 db indexing performance for mosaics--blue is Postgis, red is H2. The raster data are 206 Geotiffs for the gfs .25 deg Total_precipitation layer
[image: chart(1)] https://user-images.githubusercontent.com/39599557/66175119-9fba5000-e615-11e9-9fe2-9030b1d94884.png
The data is gathered via GetTimeSeries request over 100 runs for a point
{ for t in
seq 1 100
;do /usr/bin/time -p curl "http://localhost:8080/geoserver/wms?SERVICE=WMS&VERSION=1.1.1&REQUEST=GetTimeSeries&FORMAT=text%2Fcsv&TIME=2019-09-20T03:00:00.000Z/2019-10-06T00:00:00.000Z&QUERY_LAYERS=test:Total_precipitation&STYLES&LAYERS=test:Total_precipitation&INFO_FORMAT=text%2Fcsv&FEATURE_COUNT=1&X=1&Y=1&SRS=EPSG%3A4326&WIDTH=1&HEIGHT=1&BBOX=-116.%2C37%2C-115%2C38" ;done } 2> postgis_gtiff_mosaic.logPostgis has an edge. In either case, not bad at all for loading and reading through 206 images!!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D4WLRDM655234VDDGDQM2NUHA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAKDDQA#issuecomment-538194368, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3D7MTPXMLQ5M2MDMG7LQM2NUHANCNFSM4I2H3ZWQ .
Raster format performance:
I used the same time-series Imagemosaic coverage--206 raster grids each with a different time step, to measure performance for Geotiff, NetCDF and Grib formats.
I timed the same GetTimeSeries request above over 100 runs to get metrics: after each 100-series run reconfiguring the mosaic with a different raster format (same data). All are backed by a Postgis index, running on the same machine. The large spike during the NetCDF runs can be ignored.
That's a huge difference. Wow!
On Fri, Oct 4, 2019 at 7:04 PM alexander-petkov notifications@github.com wrote:
Raster format performance:
I used the same time-series Imagemosaic coverage--206 raster grids each with a different time step, to measure performance for Geotiff, NetCDF and Grib formats.
I timed the same GetTimeSeries request above over 100 runs to get metrics: after each 100-series run reconfiguring the mosaic with a different raster format (same data). All are backed by a Postgis index, running on the same machine. The large spike during the NetCDF runs can be ignored.
[image: Geotif, nc and grb] https://user-images.githubusercontent.com/39599557/66247405-65b28200-e6d9-11e9-8971-01203a2c5896.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D2USO5CIILCYY6FX6DQM7RYFA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANGZZQ#issuecomment-538602726, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3DZV52LXKINXG7ULPFDQM7RYFANCNFSM4I2H3ZWQ .
Cloud Optimized Geotiff (COG) driverl available in GDAL, as of 3.1:
That's a huge difference. Wow!
NetCDF can be 1.5 to 2 times faster than Grib. Geotiffs on a very conservative estimate is 5 times faster than NetCDF--on many runs even up to 10 times faster than NetCDF, and around 10 times (roughly) faster than Grib.
The Grib format, as I undestand, is very optimized for compact size, where every bit matters. Everything is coded and then looked up in external tables during reads.
Are those Cloud-Optimized GeoTIFFs that you tested or just regular GTiffs?
On Fri, Oct 4, 2019 at 8:47 PM alexander-petkov notifications@github.com wrote:
That's a huge difference. Wow! … <#m-2978697377075146223> On Fri, Oct 4, 2019 at 7:04 PM alexander-petkov @.**> wrote: Raster format performance:* I used the same time-series Imagemosaic coverage--206 raster grids each with a different time step, to measure performance for Geotiff, NetCDF and Grib formats. I timed the same GetTimeSeries request above over 100 runs to get metrics: after each 100-series run reconfiguring the mosaic with a different raster format (same data). All are backed by a Postgis index, running on the same machine. The large spike during the NetCDF runs can be ignored. [image: Geotif, nc and grb] https://user-images.githubusercontent.com/39599557/66247405-65b28200-e6d9-11e9-8971-01203a2c5896.png — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20 https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D2USO5CIILCYY6FX6DQM7RYFA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANGZZQ#issuecomment-538602726>, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3DZV52LXKINXG7ULPFDQM7RYFANCNFSM4I2H3ZWQ .
NetCDF can be 1.5 to 2 times as fast than Grib. Geotiffs on a very conservative estimate is 5 times faster than NetCDF--on many rins even up to 10 times faster than NetCDF, and around 10 times faster than Grib.
The Grib format, as I undestand, is very optimized for compact size, where every bit matters. Everything is coded and then looked up in external tables during reads.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D7CHPBVKDXPS3IWCTDQM753BA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANIRRA#issuecomment-538609860, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3D4WP3UZSZVL3ECWUY3QM753BANCNFSM4I2H3ZWQ .
Are those Cloud-Optimized GeoTIFFs that you tested or just regular GTiffs?
Regular--they are horizontally tiled by default:
Band 1 Block=321x6 Type=Float32, ColorInterp=Gray
I did more speed metrics, this time with RTMA data (241 granules)
GeTimeSeries request:
time curl "http://localhost:8080/geoserver/wms?SERVICE=WMS&VERSION=1.1.1&REQUEST=GetTimeSeries&FORMAT=text%2Fcsv&TIME=2019-09-10T22:00:00.000Z/2019-09-20T22:00:00.000Z&QUERY_LAYERS=rtma:Temperature&STYLES&LAYERS=rtma:Temperature&INFO_FORMAT=text%2Fcsv&FEATURE_COUNT=1&X=1&Y=1&SRS=EPSG%3A4326&WIDTH=1&HEIGHT=1&BBOX=-116%2C37%2C-115%2C38"
TIF Float64 rasters: initial request: 0m15.905s after caching: 0m5.254s
TIF Int16 rasters: initial request: 0m12.388s after caching: 0m3.267s
Grib files: initial request: 0m51.913s after caching: 0m35.981s
Looks like data type matters as well... We can achieve ~10x speed increase by using Geotiffs, and the appropriate data type.
I experimented with Geotiff compression on RTMA scenes. Initially, I have read online sources regarding Geotiff read/write performance using various compression algorithm.
In the end, I tested only the core 3: LZW, Deflate and packbits. This is because Geoserver uses Java imageio for reading and writing Geotiff format, and more recently implemented compression algorithms in GDAL>2.0.0 are not recognized.
Here are some test runs of reading coordinate point from RTMA archive. This is on an AWS instance, over 10GigE link:
UNCOMPRESSED: 30 sec
LZW: 55 sec
DEFLATE: 29 sec
PACKBITS: 30 sec
The same test was performed on a Plume VM instance, using Ceph HDD storage , over 1GigE link. The instance has 4 cores available, and 8GB of RAM. Using DEFLATE compression method, with default parameters:
The same test on a 8-core instance with 16GB RAM:
The fastest times I have gotten previously on Plume VMs with Ceph storage is approximately 3m30s.
Maybe this comment belongs in this ticket.
Linking here.
Imagine we have a scenario, where each extent (rtma, ndfdf, or gfs) is divided into n by n parts, in this case 3 by 3:
0 | 1 | 2 |
---|---|---|
3 | 4 | 5 |
6 | 7 | 8 |
I did a small test, where I configured 1 such subregion (top left) and compared performance times vs the current implementation:
Some results, running on my laptop:
Not 9 times faster, but still a considerable improvement (3x faster or better).
This approach will increase performance quite a bit, but it will be a nightmare to maintain (36 mosaics for each archive, or 108 mosaics for all 3 archives) .
These "submosaics" can be hidden from being listed in the Geoserver catalog. Instead, they can be grouped (see layer groups in the Geoserver docs) , and can be requested altoghether, for example in a GetMap request. I haven't confirmed this in practice.
Wow, that's a pretty big benefit in speed-up. Do you think the extra complexity is worth it?
On Tue, Aug 25, 2020 at 3:35 PM alexander-petkov notifications@github.com wrote:
Imagine we have a scenario, where each extent is divided into n by n parts, in this case 3 by 3: 0 1 2 3 4 5 6 7 8
- Each scene is horizontally tiled
- Every subregion is configured as a separate "submosaic" if you will.
- Based on user input coordinates, a subregion is picked from an index, and only the appropriate "submosaic" is being queried for point data.
I did a small test, where I configured 1 such subregion (top left) and compared performance times vs the current implementation:
Some results, running on my laptop:
[image: RTMA point query_ tiled vs not tiled] https://user-images.githubusercontent.com/39599557/91229557-4f455880-e6e7-11ea-8a68-f1e2174653a4.png
Not 9 times faster, but still a considerable improvement (3x faster or better).
This approach will increase performance quite a bit, but it will be a nightmare to maintain (36 mosaics for each archive, or 108 mosaics for all 3 archives) .
These "submosaics" can be hidden from being listed in the Geoserver catalog. Instead, they can be grouped (see layer groups in the Geoserver docs) , and can be requested altoghether, for example in a GetMap request. I haven't confirmed this in practice.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20#issuecomment-680281592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4G3D33UKKMHQA6YRXMFTDSCQVCLANCNFSM4I2H3ZWQ .
Wow, that's a pretty big benefit in speed-up. Do you think the extra complexity is worth it?
It might be worth it, given the potential increase in speed.
Even more so in the long run, if we look to provide data from a 20-year long archive.
It will take some time to implement... I have been experimenting with the Python gsconfig api for creating/updating coverages en masse.
Right now it takes around 9 sec to generate a Weatherstream file from all 3 archives. When I started initially, it took 38 minutes.
This approach could bring times down to 3 sec or less.
I tested the "Whole scene vs the sub-mosaic" scenario on AWS, using the RTMA archive.
I was expecting much more dramatic results:
Query time is decreased less than a second...
Given these disappointing numbers, along with the complexity in managing over 100 mosaics per archive, I will abandon this approach.
Are these GeoTIFFs being queried now?
On Thu, Aug 27, 2020 at 11:41 AM alexander-petkov notifications@github.com wrote:
I tested the "Whole scene vs the sub-mosaic" scenario on AWS, using the RTMA archive.
I was expecting much more dramatic results: [image: RTMA on AWS_ tiled vs not tiled (seconds)] https://user-images.githubusercontent.com/39599557/91476001-a0755980-e859-11ea-9922-ead4504af946.png
Query time is decreased less than a second...
Given these disappointing numbers, along with the complexity in managing over 100 mosaics per archive, I will abandon this approach.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20#issuecomment-682093832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4G3DY4AWER2L34OOKG36DSC2LFRANCNFSM4I2H3ZWQ .
Are these GeoTIFFs being queried now?
Yes
Experiment with raster format optimization for OGC services. This is meant to be space for keeping notes with regards to different data formats--ease of configuration, space, optimized data delivery, etc.
Configure a method to gather metrics with respect to using different raster formats. Add S3-geotiff to the mix: https://www.cogeo.org