Open carlosg-m opened 4 months ago
Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.
Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.
Thank you for the response, @mjohns-databricks. Are there any best practices to make the operation I described efficient, using the current version of Mosaic?
The workaround is to generate a table of "reading windows or bounding boxes" and with a UDF go through each one in parallel, loading each window with Rasterio and converting it to a "grid index".
This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):
When trying to use "retile" option it throws an error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.
Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).