JuliaGeo / Shapefile.jl

Parsing .shp files in Julia
http://juliageo.org/Shapefile.jl/
MIT License
82 stars 14 forks source link

Reading Shapefiles directly from zipfiles #75

Closed dgleich closed 2 months ago

dgleich commented 1 year ago

Many Shapefiles are distributed directly as zip files.

The routine (below) shows how it is possible to read them directly from the zip file without decompressing it on disk. I used this to read all 3000 zip files from the us road database.

This seems like it might be a useful feature to add to the library. If that's something that might be of interest, let me know as there would be a few different ways this could be integrated into the library.

## Code to read shapefiles from zips
import ZipFile, Shapefile
function read_shp_from_zipfile(zipfile)
  r = ZipFile.Reader(zipfile)
  # need to get dbx
  shpdata, shxdata, dbfdata, prjdata = nothing, nothing, nothing, nothing
  for f in r.files
    fn = f.name
    lfn = lowercase(fn)
    if endswith(lfn, ".shp")
      shpdata = IOBuffer(read(f))
    elseif endswith(lfn, ".shx")
      shxdata = read(f, Shapefile.IndexHandle)
    elseif endswith(lfn, ".dbf")
      dbfdata = Shapefile.DBFTables.Table(IOBuffer(read(f)))
    elseif endswith(lfn, "prj")
      prjdata = try
        Shapefile.GeoFormatTypes.ESRIWellKnownText(read(f, String))
      catch
        @warn "Projection file $zipfile/$lfn appears to be corrupted. `nothing` used for `crs`"
        nothing 
      end
    end
  end
  close(r)
  @assert shpdata !== nothing
  shp = if shxdata !== nothing # we have shxdata/index 
    read(shpdata, Shapefile.Handle, shxdata)
  else
    read(shpdata, Shapefile.Handle)
  end 
  if prjdata !== nothing
    shp.crs = prjdata 
  end 
  return Shapefile.Table(shp, dbfdata)
end 
visr commented 1 year ago

Thanks for raising the issue and sharing the code. I think indeed just using zipped shapefiles is becoming more common with other software like GDAL supporting it directly. One alternative approach I can think of is using https://github.com/JuliaIO/TranscodingStreams.jl, where users can supply the decompressor from CodecZlib. That way we avoid the JLL dependency while still making it easier to load from a compressed file (not just zipfiles).

Though since zipfiles are the most common and the JLL dependency is small perhaps just directly depending on CodecZlib is also reasonable.

dgleich commented 1 year ago

So the simplest thing might be to setup Shapefile.jl to allow it to take in any object with an iterator over file IOs where each file has a .name entry. E.g. so you could call...

shp = Shapefile.Table(ZipFile.Reader("myfile.zip").files)

the ".files" object is really a Vector of IOs. So the generic input could be Vector{T} where T <: IO (but this doesn't always give a way to list filenames... hmm...)

This would avoid any dependencies, and still make it pretty easy to use.

It sounds like something similar might exist at some point for Tar files too.

rafaqz commented 1 year ago

@dgleich if you ever wanted to PR this change it would be useful.

asinghvi17 commented 3 months ago

This could also be implemented as an extension, with a nice error message saying that you have to load ZipFiles.jl for this to work correctly!

asinghvi17 commented 2 months ago

Solved by #113