databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
276 stars 66 forks source link

OutOfMemoryError When Reading Large COG File with mosaic.read() in Databricks #496

Open Thimm opened 10 months ago

Thimm commented 10 months ago

Describe the bug I am encountering an OutOfMemoryError when attempting to read a large Cloud Optimized GeoTIFF (COG) file (2.4GB in size) using the mosaic.read() method in an Azure Databricks environment. The error occurs during the execution of df.show() after reading the file.

To Reproduce

  1. Download file to dbfs
  2. Run
    from databricks import mosaic
    mosaic.enable_mosaic(spark, dbutils)
    file_path = "[path to file]"
    df = mosaic.read().format("raster_to_grid")\
    .option("driverName", "GTiff")\
    .option("fileExtension", "*.tif")\
    .load(f"file://{file_path}")
    df.show()

Expected behavior The expectation is to successfully read the COG file into a DataFrame and display it using df.show() without encountering memory issues.

Additional Context

Environment Databricks Runtime Version: 3.3 LTS (includes Apache Spark 3.4.1, Scala 2.12) Cluster Configuration: Standard_D32ads_v5 128 GB Memory, 32 Cores Language: Python

Traceback.txt

milos-colic commented 10 months ago

@Thimm Thank you for reporting this issue. It will be resolved in the next release. There has been a bug that was causing retiling of a large file to happen at a deferred stage and not immediately on read. Spark buffers do not support binaries > 2GB so on read we have to retile the file to tiles that are < 2gb and then perform transformations on those. I will be opening a PR today and this will be a part of the next release. I ran the provided file on my local machine with the new fix without any issues using a docker and rosetta tanslation since I am on mac M1 - even with those constraints it runs now. The next release should be out within a couple of weeks.