databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
269 stars 66 forks source link

`raster_to_grid` not working when using `retile` with `GeoTIFF` file #558

Open carlosg-m opened 4 months ago

carlosg-m commented 4 months ago

This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

When trying to use "retile" option it throws an error:

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .option("retile", "true")\
        .option("tileSize", "1000")\
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.

Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).

mjohns-databricks commented 4 months ago

Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.

carlosg-m commented 4 months ago

Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.

Thank you for the response, @mjohns-databricks. Are there any best practices to make the operation I described efficient, using the current version of Mosaic?

The workaround is to generate a table of "reading windows or bounding boxes" and with a UDF go through each one in parallel, loading each window with Rasterio and converting it to a "grid index".