databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
269 stars 66 forks source link

NullPointerException when loading a shapefile #553

Closed FreddiePalfreman closed 5 months ago

FreddiePalfreman commented 5 months ago

I'm using Mosaic 0.4.1 on DBR 13.3 LTS (photon enabled), with mounts to ADLS Gen2 containers.

I followed the GDAL installation guide with the default options for setup_gdal(), then added the init script in my cluster settings and restarted.

Then I've followed the Mosaic + GDAL Shapefile Example:

Screenshot 2024-04-15 at 11 50 10

Using the same Shapefiles as in the example, I can see the zipped and unzipped versions in DBFS:

Screenshot 2024-04-15 at 12 02 34

However I get a NullPointerException when I try and read the file:

df = mos.read().format("multi_read_ogr").load("dbfs:/mnt/bronze/tl_rd22_01001_addrfeat/")

I've tried with other forms of the file path (/dbfs/mnt/, /mnt/) but these give the same error. I've also tried the native Spark reader (spark.read.format("shapefile").load()), but this also produces the same error.

java.lang.NullPointerException
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
File <command-4185522399211453>, line 1
----> 1 mos.read().format("multi_read_ogr").load("dbfs:/mnt/bronze/1_postcodes/tl_rd22_01001_addrfeat/")
      2 # df = spark.read.format("shapefile").load("/mnt/bronze/1_postcodes/b/b.shp")

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/mosaic/readers/mosaic_data_frame_reader.py:55, in MosaicDataFrameReader.load(self, *paths)
     49 def load(self, *paths):
     50     """
     51     Load the data source as a MosaicDataFrame.
     52     :param paths: Paths to the data source.
     53     :return: MosaicDataFrame.
     54     """
---> 55     df = self.reader.load(*paths)
     56     return DataFrame(df, SQLContext(self.spark.sparkContext))

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.__call__(self, *args)
   1349 command = proto.CALL_COMMAND_NAME +\
   1350     self.command_header +\
   1351     args_command +\
   1352     proto.END_COMMAND_PART
   1354 answer = self.gateway_client.send_command(command)
-> 1355 return_value = get_return_value(
   1356     answer, self.gateway_client, self.target_id, self.name)
   1358 for temp_arg in temp_args:
   1359     if hasattr(temp_arg, "_detach"):

File /databricks/spark/python/pyspark/errors/exceptions/captured.py:188, in capture_sql_exception.<locals>.deco(*a, **kw)
    186 def deco(*a: Any, **kw: Any) -> Any:
    187     try:
--> 188         return f(*a, **kw)
    189     except Py4JJavaError as e:
    190         converted = convert_exception(e.java_exception)

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o730.load.
: java.lang.NullPointerException
    at com.databricks.labs.mosaic.datasource.OGRFileFormat$.getLayer(OGRFileFormat.scala:97)
    at com.databricks.labs.mosaic.datasource.multiread.OGRMultiReadDataFrameReader.load(OGRMultiReadDataFrameReader.scala:41)
    at com.databricks.labs.mosaic.datasource.multiread.OGRMultiReadDataFrameReader.load(OGRMultiReadDataFrameReader.scala:23)
    at com.databricks.labs.mosaic.datasource.multiread.MosaicDataFrameReader.load(MosaicDataFrameReader.scala:72)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
    at py4j.Gateway.invoke(Gateway.java:306)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
    at java.lang.Thread.run(Thread.java:750)
sllynn commented 4 months ago

Hey Freddie - hope you found a resolution for this.

For others finding this page, please try checking to see if your shapefile (or geodb, geojson etc.) is valid and can be read by GDAL by running something like this in a separate cell:

%sh ogrinfo <<filename>>.shp
FreddiePalfreman commented 4 months ago

@sllynn My issue was reading the shapefiles straight from the ADLS mount point. When I copied the files from the mount into my Databricks workspace with dbutils.fs.cp(), Mosaic's vector file readers worked a charm! So it seems that the file was valid, it is just unable to read files from a mount point.