Add support for MBTiles output

markerikson commented 6 years ago

CTB currently writes individual terrain tiles directly to disk. As zoom levels increase, this leads to millions of individual terrain tile files in thousands of folders, which are hard to copy and move around.

The MBTiles file format is a widely used container for image and terrain tiles. It would be great if CTB supported writing tiles directly into a designated MBTiles container file.

We're planning to tackle regenerating our own imagery and terrain datasets within the next few months. It'd be great if someone happened to implement MBTiles capability before then. If not, I may be able to tackle it myself, although it would help if someone who's more familiar with this codebase could offer advice on the best approach for doing so.

ediebold commented 6 years ago

I know this doesn't really answer the question, but this might be useful:

https://www.npmjs.com/package/mbtiles-terrain-server

Obviously requiring node and things is annoying, but this looks like it may do what you're after.

markerikson commented 6 years ago

I... actually seem to have added working support for writing directly to MBTiles this afternoon :)

A basic implementation appeared to be working right before I had to leave. I've got several more aspects of it I want to look at, though, including de-duping inserted tiles.

We actually have our own little homegrown Python server that can serve up tiles from either TMS folders on disk, or by retrieving them from MBTiles files. Wish I could post the code for that, but it's "proprietary". Really, though, it's only like 250 lines of Python, and half of that is looking in subfolders for available tilesets and listing them internally as available. The actual logic for handling a request is straightforward - take ZXY values, query DB, return blob.

Fortunately, since CTB is an open-source project already, I do intend on actually filing a PR that adds the write-to-MBTiles capability once I'm sure it's working well.

(Side note: a quick test of a 5x5-degree 90m dataset took 1:35 to write out 17K tiles. Writing that same dataset into an MBTiles file only took 35 seconds :) Let's hear it for not touching the disk as much!)

markerikson commented 6 years ago

So that was with heightmaps. Just did another run with quantized-mesh tiles. Here's how all this looks together. The calculations generated 17,685 tiles, and this was on a Win10 machine, NTFS file system.

Time (minutes)

	Heightmaps	Quantized-Mesh
Tile files	1:35	2:20
MBTiles	0:35	1:45

Size (MB)

	Heightmaps	Quantized-Mesh
Tile files	72.1	28.4
MBTiles	76.2	31.9

Size used on disk (MB)

	Heightmaps	Quantized-Mesh
Tile files	109	54
MBTiles	76.2	31.9

Summarizing:

Quantized-mesh tiles take longer to generate than heightmaps. (Not surprising - there's a bunch more calculations to do.)
Writing directly into an MBTiles file is faster than a bunch of individual files
Writing into an MBTiles file has some size overhead compared to the raw size of the individual files, but depending on your file system, it's probably a lot smaller than the min block size overhead for those hundreds of thousands of files, and of course it's a lot easier to copy one multi-gig file than hundreds of thousands of individual files.

ahuarte47 commented 6 years ago

Very interesting report, thanks for sharing Mark

markerikson commented 6 years ago

I just tried using MD5 hashing to detect duplicate tiles and normalize them, as seen in some other MBTiles implementations, but an initial test run against that same block showed no duplicates. My immediate guess is that terrain tiles (especially quantized-mesh format) are unlikely to be bit-for-bit identical, whereas images may be likelier to see duplication. (I suppose the most likely candidates for duplicates in either case would be over the ocean, and what I've seen is that most terrain datasets only cover land areas for obvious reasons.)

I'll stash those changes and leave them out for now. No point in the expense of running MD5 hashes if there's not going to be any dupes found.

markerikson commented 6 years ago

Yeah, thinking about it further... since heightmaps are just height values, those might all be bit-for-bit identical. However, since quantized-mesh files by definition involve lat/lons and ECEF coordinates, they can't be identical.

I'm doing some tests with a VRT that uses four 1x1 tiles sliced from the corners of a 5x5 tile as the sources. GDAL is interpreting all of the empty space in the middle as heights of 0, so you'd expect those to be identical results.

For a quantized-mesh MBTiles output, all 17865 tiles are unique. Many of them appear to be 131 bytes, but the bytes are different.

For the same source and heightmap MBTiles output, I see 3105 unique tiles out of 17865. Weirdly, the MBTiles file size seemed about the same between a run with de-duplication and without de-duplication.

Since my current end goal is to actually generate a quantized-mesh dataset for myself, I'm going to stash the de-duplication changes and keep moving.

markerikson commented 6 years ago

Progress update: my remaining goal is to be able to have CTB intelligently only generate terrain tiles for areas with actual data. That way, we won't waste large amounts of time and disk space generating tiles for areas like the oceans.

The problem is that a worldwide terrain dataset creates a worldwide bounding box. Depending on your dataset you might not have any actual terrain data for the oceans, but the oceans are included in the bounding box. So, CTB will try to iterate over the entire earth to generate tiles, and that will include a lot of "empty" ocean space (either as zeros, or actual NODATA values).

GDAL 2.2 introduced a "sparse datasets" capability, which lets us query blocks of interest to see if they do or do not contain valid data. As of last night, I think I've successfully been able to use that in TerrainTiler.cpp to check if the area covered by a tile is empty.

From here, my plan is:

1) Add logic in both TerrainTiler.cpp and MeshTiler.cpp to mark their tile types as "invalid". If so, skip actually doing the terrain warping for that tile entirely. 2) In the tile iterator threads, collect the coordinates of all valid tiles 3) Use a clustering algorithm like DBSCAN to group up nearby tile coordinates on each zoom level 4) Collect the bounding boxes for those clusters, for each level 5) Write out the multiple bounding boxes in the layer.json metadata file

I figure this should result in a drastic reduction in both final output size and processing time for a large worldwide dataset.

ediebold commented 6 years ago

Awesome idea.

This feels like this would be the first step towards allowing us to add new rasters to existing terrain sets. As I see it, to do this, there’s basically three major cases to consider:

The new raster has information on a terrain tile that doesn’t yet exist. Here you can simply create it.
The new raster has data that covers all of a tile that already exists. Here, you can simply overwrite it (or don’t? Depending on settings?)
The new raster has data for a part of a tile that already exists. This is the trickiest part, and where we’d need the ability you describe to intelligently get the bounds. When this case happens, we could do something like use gdal_merge to merge the VRT of the current tile, and the tif from a ctb-export call. Not sure if this is the most efficient way, but it could work.

I’d be willing to have a crack at adding this functionality once you’ve got your stuff working, but my C++ is pretty poor, so if someone who knows what they’re doing wanted to beat me to it, I’d be fine with that.

geo-data / cesium-terrain-builder

Add support for MBTiles output #56