Question: performance challenge with FI 2m DEM via VRT

MaRaSu commented 9 months ago

First: thanks for developing and open sourcing OpenTopoData - great documentation, very easy to set-up.

I'm using a high resolution Finland DEM data with 2m grid. I built a VRT for the data set which is in Finnish ETRS-TM35FIN CRS, tiles are 3k x 3k with LZW compression, biggest tiles over 25MB.

I'm feeding to elevation API coordinates from a route planner (just a little less dense than a GPX track would be) so coordinates are close to each other and many / most for my test cases should fit a single tile. My challenge is the performance: for 95 coordinates (which is my chunking threshold) I get about 12 - 15sec response time on my 6 core (Intel 8700K) / 64GB / data on fast HDD server, when using my other test server a low end 2vCPU 2GB cloud server with cloud SSD I'm getting for same data a 26sec response time. Single coordinate on faster server is about 460ms, 10 coordinates 1663ms.

I would be very grateful for ideas how to improve the performance. Is the VRT / CRS / tile format the cause for the dramatically lower performance than the benchmark in the documentation pages?

ajnisbet commented 9 months ago

Yes that is much too slow!

You could try moving a single tile file to its own folder, adding a new opentopodata dataset for that folder, and see how long it takes to query points within that file's bounds? That will tell us if the slowness comes from the VRT or the underlying data files.

That said, I'm almost certain the VRT is the issue. The 2m Finland DEM will mean about 7000 tiles in your VRT. Unfortunately gdal (and therefore opentopodata) is surprisingly slow and naive in parsing large VRTs, I suspect that is the bottleneck.

It's been on my todo-list to improve this situation (by manually parsing VRTs instead of relying on gdal, or building an in-memory spatial index, or nested vrts) but I probably won't be getting round to that any time soon sorry.

Because the Finish CRS is nicely gridded, it is possible to use the files directly with opentopodata if you're willing to rename them according to the lower-left corner.

File P4/P42/P4232A.tif has the bounds

Upper Left  (  356000.000, 7020000.000) ( 24d 7'43.93"E, 63d16'47.67"N)
Lower Left  (  356000.000, 7014000.000) ( 24d 8' 3.15"E, 63d13'34.05"N)
Upper Right (  362000.000, 7020000.000) ( 24d14'53.99"E, 63d16'56.16"N)
Lower Right (  362000.000, 7014000.000) ( 24d15'12.42"E, 63d13'42.53"N)

If you rename the file to P4/P42/P4232A_N7014000E356000.tif and add the dataset to config.yaml like

datasets:
- name: finland-2m-renamed
  path: data/finland-2m-renamed/
  filename_tile_size: 6000  # Metres
  filename_epsg: 3067  # ETRS-TM35FIN

then opentopdata can find the files directly without a VRT.

I know that it's not ideal to rename your files, especially when you have to open them with gdal/rasterio to find the bounds!

MaRaSu commented 9 months ago

@ajnisbet very much appreciate your super fast reply with detailed instructions. Your guess on VRT being the problem was indeed spot on. I tested with a single renamed tile, queried with 95 coordinates, response came back in ~360ms with my faster server. I also tested (just to be sure) the elevation values for those same 95 coords - small differences between vrt vs. no-vrt approach (range -0.21m to 0.47m, mean 0.03m, median 0.01m, std deviation 0.11m so good for my needs).

Now I just need to build a little script to do the renaming. I somehow recall seeing something like that either in docs or in some issue discussion? Anyways, I can get it sorted one way or another. Again, thanks for your support!

It was also good to learn that GDAL is having the problem with VRT performance. I'm using quite a bit of other data in Finnish ETRS-TM35FIN as VRT because it is convenient - but apparently not very performant!

MaRaSu commented 9 months ago

For others potentially facing the same challenge, here is a simple Python script for renaming files to SRTM naming convention utilising the information in the GeoTIFF-files:

import os
import rasterio
from glob import glob

def main():
    dataset_folder = './data/'  # Adjust this to your dataset folder
    pattern = os.path.join(dataset_folder, '**', '*.tif')
    paths = sorted(glob(pattern, recursive=True))

    for old_path in paths:
        with rasterio.open(old_path) as dataset:
            # Get the bounds of the dataset
            bounds = dataset.bounds

            # Use the lower left corner (bounds.left, bounds.bottom)
            easting, northing = bounds.left, bounds.bottom

            # Build new filename
            base, ext = os.path.splitext(os.path.basename(old_path))
            new_filename = f"{base}_N{int(northing)}E{int(easting)}{ext}"
            new_path = os.path.join(os.path.dirname(old_path), new_filename)

            # Rename the file
            os.rename(old_path, new_path)
            print(f"Renamed {old_path} to {new_path}")

if __name__ == "__main__":
    main()

MaRaSu commented 9 months ago

@ajnisbet After I did the renaming I still did not get it working with the whole tileset - it only worked with a single tile test that I did previously. The problem was obviously (though I chased some other reasons as well...) in the file naming.

Eventually with some code reading and print outs I figured out that my tileset did not comply with the lower left of each tile being a multiple of the "base" (in my case 6000 m). Rather for easting there was an offset of 2000 as was visible also in your example tile earlier in this thread: lower left 356000 (divided by 6000 gives quotient of 59, remainder of 2000).

I made a quick hack to utils.py / def decimal_base_floor(x, base=1, offset=0): and config.py / def _location_to_tile_corner(cls, xs, ys, tile_size=1):to take into account the offset, but this should be generalized (if you agree with the problem diagnosis & solution).

ajnisbet commented 9 months ago

Would a config option like filename_tile_size_offset fix your issue without having to modify opentopodata? (Default of 0, you'd use 2000)?

(Again sorry for the awkward API! This project started off just for SRTM tiles, and has been hackily extended to cover more usecases until I add proper spatial indexing).

MaRaSu commented 9 months ago

@ajnisbet it would, but it would have to be separate for northing and easting (for Finnish 2m DEM there is an offset of 2000 only for the easting, northing is based on 0).

And absolutely no need to feel sorry or bad for anything! You have done quite some heavy lifting to create OpenTopoData, hopefully the rest of us can contribute a bit into further enhancing it - and with OSS one can always customize / extend it as needed (BTW very easy to understand code with helpful comments).

ajnisbet commented 9 months ago

Ok good point about the differing offsets, I also don't want to support tile_size_x and tile_size_y, would rather put the effort into other features!

And thank you!

MaRaSu commented 9 months ago

@ajnisbet Ok, understood. Anyways I made a fork (tile_size_offset branch) with a code change for supporting tile_size_offset_xand tile_size_offset_y dataset params, I can turn that to a PR if it makes sense (understand that issue #91 is a better solution) . And I do not know how common is this practise of having tile bounds not be divisible with tile size.

ajnisbet commented 6 months ago

This would probably need all of

tile_size_offset_x
tile_size_offset_y
tile_size_x
tile_size_y

and I think would only help a small number of usecases. Closing in favour of https://github.com/ajnisbet/opentopodata/issues/91

ajnisbet / opentopodata

Question: performance challenge with FI 2m DEM via VRT #90