databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
278 stars 67 forks source link

Question: Raster Tile Merging and TIF File Output #489

Open RickLeite opened 11 months ago

RickLeite commented 11 months ago

How can I merge raster tiles and write them to a TIFF file?

Is there already a way to do that, or is it planned to be introduced?


My Current Approach:

df = spark.read.format("gdal").option("extensions", "tif")\
           .load("dbfs:/FileStore/temp/rastersfile/extracted")\
           .groupBy().agg(  collect_list("tile").alias("tile"))

merged_tile = df.select(mos.rst_merge("tile"))

result = merged_tile.select("rst_merge(tile)").collect()[0]

raster_data_base64 = result["rst_merge(tile)"]["raster"]
binary_raster_data = bytes(raster_data_base64)

output_path = "/dbfs/FileStore/temp/rastersfile/merged/mergedrasters.tif"
with open(output_path, "wb") as output_file:
    output_file.write(binary_raster_data)
RickLeite commented 11 months ago

Clearly, my current approach results in the loss of all file Metadata. Additionally, handling a large number of rasters is causing kernel issues due to memory constraints. I've attempted to use the latest rasterio UDFs, but I'm unsure how to proceed after merging the tiles.

RickLeite commented 11 months ago

Using rasterio udf

df = spark.read.format("gdal").option("extensions", "tif")\
           .load('/FileStore/temp/esri')\
           .groupBy().agg(collect_list("tile").alias("tile"))

merged_tile = df.select(mos.rst_merge("tile").alias('merged'))
import numpy as np
import rasterio
from rasterio.io import MemoryFile
from io import BytesIO
from pyspark.sql.functions import udf
from pathlib import Path

@udf("string")
def write_raster(raster, parent_dir):
  with MemoryFile(BytesIO(raster)) as memfile:
    with memfile.open() as dataset:
      Path(parent_dir).mkdir(parents=True, exist_ok=True)
      extensions_map = rasterio.drivers.raster_driver_extensions()
      driver_map = {v: k for k, v in extensions_map.items()}
      extension = driver_map[dataset.driver]
      file_id = 5234476790949929865   # Manually set UUID
      path = f"{parent_dir}/{file_id}.{extension}"
      print(f" parent_dir: {parent_dir}, file_id: {file_id}, extension: {extension}")

      with rasterio.open(path, "w", **dataset.profile) as dst:
        dst.write(dataset.read())
        print(f"writed to: {path}")
      return path

Since the returned merged tiles only provide the index_id, raster, parentPath, and driver, I manually set the UUID myself


merged_tile.select(write_raster("merged.raster", lit("dbfs:/FileStore/temp/esri/rastermerged"))).show(truncate=False)

Apparently it is little buggy; it wrote to 'dbfs:' as if it were a 'dbfs:' folder, and surprisingly I can't access it by browsing the DBFS from the Databricks catalog. But anyway, I was able to move the file to the desired location with shutil.

import shutil
shutil.copy('dbfs:/FileStore/temp/esri/rastermerged/5234476790949929865.tiff', '/dbfs/FileStore/temp/esri/rastermerged/5234476790949929865.tiff')

But when downloading the merged file, it corresponded to only one of the rasters in the directory (the first one). This is strange because I merged them, and with the approach that I write decoding it from base64 to binaryformat, the results give me the merged rasters.

milos-colic commented 10 months ago

@RickLeite thank you for your question.

The parent behaviour you are describing is a current behavior which we plan to adjust. At the moment only one parent is reported even though there may be many parents. In the next versions we will update the schema to capture a list of parents as opposed to a single string parent path which we have now.

So your output file is a merged raster but it only selects a first parent from the collected set at runtime (wont be the same value between reruns).

This is currently planned for 0.4.1 version.

Kind regards Milos

RickLeite commented 10 months ago

Hi @milos-colic,

Appreciate your response! Excited for what's ahead!