Defra-Data-Science-Centre-of-Excellence / pyspark-vector-files

Read vector files into a Spark DataFrame with geometry encoded as WKB.
https://defra-data-science-centre-of-excellence.github.io/pyspark-vector-files/
MIT License
5 stars 1 forks source link

Cumbersome reading of zipped vector files with `read_vector_files` #24

Open jpd-defra opened 2 years ago

jpd-defra commented 2 years ago

Attempts to fix #22.

1) Adds a new hidden function _check_compressed to see if the provided path is pointing to a compressed file. This function is used within read_vector_files and also in _get_paths.

2) Added a check to see that if a compressed file is provided, that a vsi_prefix has also been provided.

3) _get_paths has been modified to deal with compressed file inputs, avoiding an error even though they are not directories. The parent directory and suffix are extracted and used accoridngly to format the path(s).

This may not be the best implementation, but I hope it is a small improvement in working with compressed files. Feedback/suggestions for modifications welcome!

Future work could include automatically matching vsi_prefixes depending on suffix type, although you may want the user to remain in control of this.

jpd-defra commented 1 year ago

pyspark_vector_reading drawio

My first go at mapping reading proecdures - with some unanswered questions up for discussion