Raster format optimization for Geoserver

alexander-petkov commented 4 years ago

Experiment with raster format optimization for OGC services. This is meant to be space for keeping notes with regards to different data formats--ease of configuration, space, optimized data delivery, etc.

Configure a method to gather metrics with respect to using different raster formats. Add S3-geotiff to the mix: https://www.cogeo.org

wmjolly commented 4 years ago

So I THINK that the S3-geotiff (COGeo) is basically just a tiled and compressed GeoTIFF and most of that can be done using gdal_translate.

On Tue, Sep 24, 2019 at 11:32 PM alexander-petkov notifications@github.com wrote:

Experiment with raster format optimization for OGC services. This is meant to be space for keeping notes with regards to different data format--ease of configuration, space, optimized data delivery, etc.

Configure a method to gather metrics with respect to using different raster formats. Add S3-geotiff to the mix: https://www.cogeo.org

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3DYP2RPUYKN3EHW56FTQLLZX3A5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNPYGVA, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3D5QWG7MAKNLE7HBUZTQLLZX3ANCNFSM4I2H3ZWQ .

alexander-petkov commented 4 years ago

ImageMosaic index store performance

Postgis vs H2 db indexing performance for mosaics--blue is Postgis, red is H2. The raster data are 206 Geotiffs for the gfs .25 deg Total_precipitation layer

chart(1)

The data is gathered via GetTimeSeries request over 100 runs for a point

{ for t in `seq 1 100`;do /usr/bin/time -p  curl "http://localhost:8080/geoserver/wms?SERVICE=WMS&VERSION=1.1.1&REQUEST=GetTimeSeries&FORMAT=text%2Fcsv&TIME=2019-09-20T03:00:00.000Z/2019-10-06T00:00:00.000Z&QUERY_LAYERS=test:Total_precipitation&STYLES&LAYERS=test:Total_precipitation&INFO_FORMAT=text%2Fcsv&FEATURE_COUNT=1&X=1&Y=1&SRS=EPSG%3A4326&WIDTH=1&HEIGHT=1&BBOX=-116.%2C37%2C-115%2C38" ;done } 2> postgis_gtiff_mosaic.log

Postgis has an edge. In either case, not bad at all for loading and reading through 206 images!!

wmjolly commented 4 years ago

Seems like Postgis is the winner for this task. Pretty fast.... Nice comparison Alex.

MJ

On Thu, Oct 3, 2019 at 7:43 PM alexander-petkov notifications@github.com wrote:

ImageMosaic index store performance

Postgis vs H2 db indexing performance for mosaics--blue is Postgis, red is H2. The raster data are 206 Geotiffs for the gfs .25 deg Total_precipitation layer

[image: chart(1)] https://user-images.githubusercontent.com/39599557/66175119-9fba5000-e615-11e9-9fe2-9030b1d94884.png

The data is gathered via GetTimeSeries request over 100 runs for a point

{ for t in seq 1 100;do /usr/bin/time -p curl "http://localhost:8080/geoserver/wms?SERVICE=WMS&VERSION=1.1.1&REQUEST=GetTimeSeries&FORMAT=text%2Fcsv&TIME=2019-09-20T03:00:00.000Z/2019-10-06T00:00:00.000Z&QUERY_LAYERS=test:Total_precipitation&STYLES&LAYERS=test:Total_precipitation&INFO_FORMAT=text%2Fcsv&FEATURE_COUNT=1&X=1&Y=1&SRS=EPSG%3A4326&WIDTH=1&HEIGHT=1&BBOX=-116.%2C37%2C-115%2C38" ;done } 2> postgis_gtiff_mosaic.log

Postgis has an edge. In either case, not bad at all for loading and reading through 206 images!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D4WLRDM655234VDDGDQM2NUHA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAKDDQA#issuecomment-538194368, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3D7MTPXMLQ5M2MDMG7LQM2NUHANCNFSM4I2H3ZWQ .

alexander-petkov commented 4 years ago

Raster format performance:

I used the same time-series Imagemosaic coverage--206 raster grids each with a different time step, to measure performance for Geotiff, NetCDF and Grib formats.

I timed the same GetTimeSeries request above over 100 runs to get metrics: after each 100-series run reconfiguring the mosaic with a different raster format (same data). All are backed by a Postgis index, running on the same machine. The large spike during the NetCDF runs can be ignored.

Measuring Raster Performance

wmjolly commented 4 years ago

That's a huge difference. Wow!

On Fri, Oct 4, 2019 at 7:04 PM alexander-petkov notifications@github.com wrote:

Raster format performance:

I used the same time-series Imagemosaic coverage--206 raster grids each with a different time step, to measure performance for Geotiff, NetCDF and Grib formats.

I timed the same GetTimeSeries request above over 100 runs to get metrics: after each 100-series run reconfiguring the mosaic with a different raster format (same data). All are backed by a Postgis index, running on the same machine. The large spike during the NetCDF runs can be ignored.

[image: Geotif, nc and grb] https://user-images.githubusercontent.com/39599557/66247405-65b28200-e6d9-11e9-8971-01203a2c5896.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D2USO5CIILCYY6FX6DQM7RYFA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANGZZQ#issuecomment-538602726, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3DZV52LXKINXG7ULPFDQM7RYFANCNFSM4I2H3ZWQ .

alexander-petkov commented 4 years ago

Cloud Optimized Geotiff (COG) driverl available in GDAL, as of 3.1:

https://gdal.org/drivers/raster/cog.html

alexander-petkov commented 4 years ago

That's a huge difference. Wow!

NetCDF can be 1.5 to 2 times faster than Grib. Geotiffs on a very conservative estimate is 5 times faster than NetCDF--on many runs even up to 10 times faster than NetCDF, and around 10 times (roughly) faster than Grib.

The Grib format, as I undestand, is very optimized for compact size, where every bit matters. Everything is coded and then looked up in external tables during reads.

wmjolly commented 4 years ago

Are those Cloud-Optimized GeoTIFFs that you tested or just regular GTiffs?

On Fri, Oct 4, 2019 at 8:47 PM alexander-petkov notifications@github.com wrote:

That's a huge difference. Wow! … <#m-2978697377075146223> On Fri, Oct 4, 2019 at 7:04 PM alexander-petkov @.**> wrote: Raster format performance:* I used the same time-series Imagemosaic coverage--206 raster grids each with a different time step, to measure performance for Geotiff, NetCDF and Grib formats. I timed the same GetTimeSeries request above over 100 runs to get metrics: after each 100-series run reconfiguring the mosaic with a different raster format (same data). All are backed by a Postgis index, running on the same machine. The large spike during the NetCDF runs can be ignored. [image: Geotif, nc and grb] https://user-images.githubusercontent.com/39599557/66247405-65b28200-e6d9-11e9-8971-01203a2c5896.png — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20 https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D2USO5CIILCYY6FX6DQM7RYFA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANGZZQ#issuecomment-538602726>, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3DZV52LXKINXG7ULPFDQM7RYFANCNFSM4I2H3ZWQ .

NetCDF can be 1.5 to 2 times as fast than Grib. Geotiffs on a very conservative estimate is 5 times faster than NetCDF--on many rins even up to 10 times faster than NetCDF, and around 10 times faster than Grib.

The Grib format, as I undestand, is very optimized for compact size, where every bit matters. Everything is coded and then looked up in external tables during reads.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20?email_source=notifications&email_token=AA4G3D7CHPBVKDXPS3IWCTDQM753BA5CNFSM4I2H3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANIRRA#issuecomment-538609860, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4G3D4WP3UZSZVL3ECWUY3QM753BANCNFSM4I2H3ZWQ .

alexander-petkov commented 4 years ago

Are those Cloud-Optimized GeoTIFFs that you tested or just regular GTiffs?

Regular--they are horizontally tiled by default: Band 1 Block=321x6 Type=Float32, ColorInterp=Gray

alexander-petkov commented 4 years ago

I did more speed metrics, this time with RTMA data (241 granules)

GeTimeSeries request:

time curl "http://localhost:8080/geoserver/wms?SERVICE=WMS&VERSION=1.1.1&REQUEST=GetTimeSeries&FORMAT=text%2Fcsv&TIME=2019-09-10T22:00:00.000Z/2019-09-20T22:00:00.000Z&QUERY_LAYERS=rtma:Temperature&STYLES&LAYERS=rtma:Temperature&INFO_FORMAT=text%2Fcsv&FEATURE_COUNT=1&X=1&Y=1&SRS=EPSG%3A4326&WIDTH=1&HEIGHT=1&BBOX=-116%2C37%2C-115%2C38"

TIF Float64 rasters: initial request: 0m15.905s after caching: 0m5.254s

TIF Int16 rasters: initial request: 0m12.388s after caching: 0m3.267s

Grib files: initial request: 0m51.913s after caching: 0m35.981s

Looks like data type matters as well... We can achieve ~10x speed increase by using Geotiffs, and the appropriate data type.

alexander-petkov commented 4 years ago

Geotiff compression

I experimented with Geotiff compression on RTMA scenes. Initially, I have read online sources regarding Geotiff read/write performance using various compression algorithm.

In the end, I tested only the core 3: LZW, Deflate and packbits. This is because Geoserver uses Java imageio for reading and writing Geotiff format, and more recently implemented compression algorithms in GDAL>2.0.0 are not recognized.

Here are some test runs of reading coordinate point from RTMA archive. This is on an AWS instance, over 10GigE link:

UNCOMPRESSED: 30 sec
LZW: 55 sec
DEFLATE: 29 sec
PACKBITS: 30 sec

The same test was performed on a Plume VM instance, using Ceph HDD storage , over 1GigE link. The instance has 4 cores available, and 8GB of RAM. Using DEFLATE compression method, with default parameters:

Initial read of the RTMA archive took 5m25s
Subsequent readings after caching took approx 1m33s

The same test on a 8-core instance with 16GB RAM:

Initial reading took 4m16s.
Subsequent readings took 0m38s. This is very close to AWS performance. During subsequent reads, the Ceph link didn't even get touched. Monitoring the Ceph cluster during these requests showed KB/s transfer rates, at most. More cores and RAM help immensely with multithreaded readings and caching.

The fastest times I have gotten previously on Plume VMs with Ceph storage is approximately 3m30s.

alexander-petkov commented 3 years ago

Maybe this comment belongs in this ticket.

Linking here.

alexander-petkov commented 3 years ago

Imagine we have a scenario, where each extent (rtma, ndfdf, or gfs) is divided into n by n parts, in this case 3 by 3:

0	1	2
3	4	5
6	7	8

Each scene is horizontally tiled
Every subregion is configured as a separate "submosaic" if you will.
Based on user input coordinates, a subregion is picked from an index, and only the appropriate "submosaic" is being queried for point data.

I did a small test, where I configured 1 such subregion (top left) and compared performance times vs the current implementation:

Some results, running on my laptop:

RTMA point query_ tiled vs not tiled

Not 9 times faster, but still a considerable improvement (3x faster or better).

This approach will increase performance quite a bit, but it will be a nightmare to maintain (36 mosaics for each archive, or 108 mosaics for all 3 archives) .

These "submosaics" can be hidden from being listed in the Geoserver catalog. Instead, they can be grouped (see layer groups in the Geoserver docs) , and can be requested altoghether, for example in a GetMap request. I haven't confirmed this in practice.

wmjolly commented 3 years ago

Wow, that's a pretty big benefit in speed-up. Do you think the extra complexity is worth it?

On Tue, Aug 25, 2020 at 3:35 PM alexander-petkov notifications@github.com wrote:

Imagine we have a scenario, where each extent is divided into n by n parts, in this case 3 by 3: 0 1 2 3 4 5 6 7 8

Each scene is horizontally tiled

Every subregion is configured as a separate "submosaic" if you will.

Based on user input coordinates, a subregion is picked from an index, and only the appropriate "submosaic" is being queried for point data.

I did a small test, where I configured 1 such subregion (top left) and compared performance times vs the current implementation:

Some results, running on my laptop:

[image: RTMA point query_ tiled vs not tiled] https://user-images.githubusercontent.com/39599557/91229557-4f455880-e6e7-11ea-8a68-f1e2174653a4.png

Not 9 times faster, but still a considerable improvement (3x faster or better).

This approach will increase performance quite a bit, but it will be a nightmare to maintain (36 mosaics for each archive, or 108 mosaics for all 3 archives) .

These "submosaics" can be hidden from being listed in the Geoserver catalog. Instead, they can be grouped (see layer groups in the Geoserver docs) , and can be requested altoghether, for example in a GetMap request. I haven't confirmed this in practice.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20#issuecomment-680281592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4G3D33UKKMHQA6YRXMFTDSCQVCLANCNFSM4I2H3ZWQ .

alexander-petkov commented 3 years ago

Wow, that's a pretty big benefit in speed-up. Do you think the extra complexity is worth it?

It might be worth it, given the potential increase in speed.

Even more so in the long run, if we look to provide data from a 20-year long archive.

It will take some time to implement... I have been experimenting with the Python gsconfig api for creating/updating coverages en masse.

Right now it takes around 9 sec to generate a Weatherstream file from all 3 archives. When I started initially, it took 38 minutes.

This approach could bring times down to 3 sec or less.

alexander-petkov commented 3 years ago

I tested the "Whole scene vs the sub-mosaic" scenario on AWS, using the RTMA archive.

I was expecting much more dramatic results: RTMA on AWS_ tiled vs not tiled (seconds)

Query time is decreased less than a second...

Given these disappointing numbers, along with the complexity in managing over 100 mosaics per archive, I will abandon this approach.

wmjolly commented 3 years ago

Are these GeoTIFFs being queried now?

On Thu, Aug 27, 2020 at 11:41 AM alexander-petkov notifications@github.com wrote:

I tested the "Whole scene vs the sub-mosaic" scenario on AWS, using the RTMA archive.

I was expecting much more dramatic results: [image: RTMA on AWS_ tiled vs not tiled (seconds)] https://user-images.githubusercontent.com/39599557/91476001-a0755980-e859-11ea-9922-ead4504af946.png

Query time is decreased less than a second...

Given these disappointing numbers, along with the complexity in managing over 100 mosaics per archive, I will abandon this approach.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexander-petkov/wfas/issues/20#issuecomment-682093832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4G3DY4AWER2L34OOKG36DSC2LFRANCNFSM4I2H3ZWQ .

alexander-petkov commented 3 years ago

Are these GeoTIFFs being queried now?

Yes

alexander-petkov / wfas

Raster format optimization for Geoserver #20

Geotiff compression