OutOfMemoryError When Reading Large COG File with mosaic.read() in Databricks

databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.

Other

276 stars 66 forks source link

Describe the bug I am encountering an OutOfMemoryError when attempting to read a large Cloud Optimized GeoTIFF (COG) file (2.4GB in size) using the mosaic.read() method in an Azure Databricks environment. The error occurs during the execution of df.show() after reading the file.

To Reproduce

Download file to dbfs

Run

from databricks import mosaic
mosaic.enable_mosaic(spark, dbutils)
file_path = "[path to file]"
df = mosaic.read().format("raster_to_grid")\
.option("driverName", "GTiff")\
.option("fileExtension", "*.tif")\
.load(f"file://{file_path}")
df.show()

Expected behavior The expectation is to successfully read the COG file into a DataFrame and display it using df.show() without encountering memory issues.

Additional Context

The COG file being read is 2.4GB in size.
This issue occurs consistently with this file size.
I tried to run it on a node with 128 GB memory

Environment Databricks Runtime Version: 3.3 LTS (includes Apache Spark 3.4.1, Scala 2.12) Cluster Configuration: Standard_D32ads_v5 128 GB Memory, 32 Cores Language: Python

Traceback.txt

databrickslabs / mosaic

OutOfMemoryError When Reading Large COG File with mosaic.read() in Databricks #496