Defra-Data-Science-Centre-of-Excellence / pyspark-vector-files

Read vector files into a Spark DataFrame with geometry encoded as WKB.
https://defra-data-science-centre-of-excellence.github.io/pyspark-vector-files/
MIT License
5 stars 1 forks source link

Add OGR subtypes #21

Closed EFT-Defra closed 2 years ago

EFT-Defra commented 2 years ago

Closes #14 by extending OGR_TO_SPARK and SPARK_TO_PANDAS lookups to include OGR subtypes such as OFSTBoolean and OFSTFloat32.

@aw-west-defra, I think this works but I haven't been able to test it with the dataset that was giving you trouble in the first place, can you double check it for me, please?

aw-west-defra commented 2 years ago

This did not solve my problem. I'm using CDAP SND.

from pyspark_vector_files import __version__, read_vector_files

assert __version__=='0.2.2'

df = read_vector_files(
  path = '/dbfs/mnt/bronze/Ordnance_Survey/Master_Map/Gzip_GML/latest/',
  suffix = '.gz',
  layer_identifier = 'TopographicLine',
  vsi_prefix = '/vsigzip/',
)

Returns Error: PythonException: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object'. Full traceback below:

Edit: The data is not corrupted. I have successfully read this with geopandas.read_file, saved it df.to_parquet, and reread it in spark.

EFT-Defra commented 2 years ago

It works if you specify the schema and coerce to it:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, ArrayType, BinaryType, BooleanType
from pyspark_vector_files import read_vector_files

schema = StructType([
  StructField("fid", StringType(), True),
  StructField("featureCode", IntegerType(), True),
  StructField("version", IntegerType(), True),
  StructField("versionDate", StringType(), True),
  StructField("theme", ArrayType(StringType(), True), True),
  StructField("accuracyOfPosition", ArrayType(StringType(), True), True),
  StructField("changeDate", ArrayType(StringType(), True), True),
  StructField("reasonForChange", ArrayType(StringType(), True), True),
  StructField("descriptiveGroup", ArrayType(StringType(), True), True),
  StructField("descriptiveTerm", ArrayType(StringType(), True), True),
  StructField("nonBoundingLine", BooleanType(), True),
  StructField("make", StringType(), True),
  StructField("physicalLevel", IntegerType(), True),
  StructField("physicalPresence", StringType(), True),
  StructField("geometry", BinaryType(), True)
])

df = read_vector_files(
  path = '/dbfs/mnt/bronze/Ordnance_Survey/Master_Map/Gzip_GML/latest/',
  suffix = '.gz',
  layer_identifier = 'TopographicLine',
  vsi_prefix = '/vsigzip/',
  schema=schema,
  coerce_to_schema=True,
)

display(df)

Which means it's probably a GML-being-a-list issue.

EFT-Defra commented 2 years ago

Specifying the schema isn't very user-friendly though, so I think it's worth exploring other options: #23