GeospatialPython / pyshp

This library reads and writes ESRI Shapefiles in pure Python.
MIT License
1.09k stars 259 forks source link

OSError Invalid argument when reading shape file #235

Closed INRIX-Mark-Gershaft closed 2 years ago

INRIX-Mark-Gershaft commented 2 years ago

On some shape files containing data from OSM (i.e. could be encoding or special characters in road name column) I am getting OSError Invalid argument.

reader = shapefile.Reader(shape_file, encodingErrors='replace')
fields = reader.fields[1:]
field_names = [field[0] for field in fields]
feature_count = 0
for sr in reader.iterShapeRecords():
    atr = dict(zip(field_names, sr.record))
    geom = sr.shape.__geo_interface__
   feature_count += 1

When I debugged what I see is that after it read last feature it is still not at the end of the file, thus in iterShapes() it thinks that there are more shapes and tries to read it:

while shp.tell() < self.shpLength:
            yield self.__shape()

Then in __shape() it parses recLength and it is garbage (some negative number), tries to compute next records' location and puts it somewhere beyond file's valid locations.

(recNum, recLength) = unpack(">2i", f.read(8))
next = f.tell() + (2 * recLength)

Since file has 600K features I don't really know how to help further. What is strange though is that QGIS and GeoPandas are able to parse same file.

karimbahgat commented 2 years ago

Interesting. First, can you report the pyshp version, and give the full python stacktrace/error message (not just the last line)?

Encoding errors are not typically OSError, so i think it's more likely what you say about byte unpacking/args when reading the .shp file.

In the for-loop, do you do any additional processing of the shapefile, or did you show the whole loop?

Best would be if there's any public links or github repos where i can access the file? Or depending on the filesize, if you can upload as an attachment or send as a file sharing link?

INRIX-Mark-Gershaft commented 2 years ago

No, this is just smoke-test script that opens each and every shape file we produce to see if it can be read and that number of features is not 0. segments_shapefile.zip

Version installed: pyshp 2.1.3

Stack trace (I ran from interpreter session which is why there is no script line):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1337, in iterShapeRecords
    for shape, record in izip(self.iterShapes(), self.iterRecords()):
  File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1169, in iterShapes
    yield self.__shape()
  File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1106, in __shape
    f.seek(next)
OSError: [Errno 22] Invalid argument

File is created using Java: https://docs.geotools.org/stable/javadocs/org/geotools/data/shapefile/ShapefileDataStore.html

Just verified - still reproducible in latest version of PyShp

karimbahgat commented 2 years ago

Thanks for providing the file and the other details. After some testing, here is a rather lengthy response, which can be summarized as the file containing corrupted data towards the end of the file. It appears the error happens when trying to read beyond shape #135307, ie the error is with #135308:

>>> import shapefile
>>> r = shapefile.Reader('segments_shapefile.zip')
>>> for shape in r.iterShapes():
...      pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1429, in iterShapes  
    shape = self.__shape(oid=i, bbox=bbox)
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1353, in __shape     
    f.seek(next)
  File "C:\Program Files\Python39\lib\tempfile.py", line 474, in func_wrapper
    return func(*args, **kwargs)
OSError: [Errno 22] Invalid argument
>>> shape
Shape #135307: POLYLINE
>>> r.shape(135307)
Shape #135307: POLYLINE
>>> r.shape(135308)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1385, in shape       
    i = self.__restrictIndex(i)
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1249, in __restrictIndex
    raise IndexError("Shape or Record index out of range.")
IndexError: Shape or Record index out of range.

It looks like the shapefile has a length of 135308 but the reader is trying to read beyond the end of the file:

>>> r.numRecords # determined from dbf header
135308
>>> r.numShapes # determined from shx header
135308
>>> len(r._offsets) # index positions read from the shx file
135308

The error happens only with shapes() and .iterShapes() since these methods keep iterating until they have reached the actual end of the .shp file. Apparently, the .shp file contains 18464768 bytes, which is more than the end of the last 135307th shape located at byte position 18462948:

>>> # seek to end of last shape
>>> last_shape_offset = r._offsets[135307]
>>> r.shp.seek(last_shape_offset) # beginning of last shape
18462732
>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> print(recNum, recLength)
135308 104
>>> r.shp.tell() + (2 * recLength) # end of last shape according to shape header
18462948
>>> r.shp.seek(0, 2) # actual end of file
18464768

Reading the actual contents of the last shape confirms that the end of the last shape should be at byte position 18462948:

>>> from struct import unpack
>>> r.shp.seek(last_shape_offset) # beginning of last shape
18462732
>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> shapeType = unpack("<i", r.shp.read(4))[0]
>>> bbox = unpack("<4d", r.shp.read(32))
>>> nParts = unpack("<i", r.shp.read(4))[0]
>>> nPoints = unpack("<i", r.shp.read(4))[0]
>>> parts = unpack("<%si" % nParts, r.shp.read(nParts * 4))
>>> points = unpack("<%sd" % (2 * nPoints), r.shp.read(16*nPoints))
>>> r.shp.tell() # end of last shape according to data contents
18462948

Similarly, the shapefile header also agrees that the length of the file should be 18462948:

>>> # read the shapefile header's `File Length` entry
>>> r.shp.seek(24)
>>> unpack(">i", r.shp.read(4))[0] * 2 
18462948

Trying to read the beginning of the next shape results in unexpected data:

>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> (recNum, recLength)                               
(50331648, -1049923325)

So this is an issue of the writer erroneously writing too much data, or the header containing incorrect header information, depending on how you view it. Since both the shx and dbf headers says the file length should be 135308, and since the beginning of the next shape contains unexpected data, I am inclined to believe that the extra data got corrupted in some way and that the headers represent the correct number of non-corrupted shapes/records. If I were to speculate I wonder if the java writer tried to write additional data, and when it failed for some reason it wrote the number of successfully written entries to the header, but did not remove the partially written extra data from the files.

Looking at how fiona/GDAL handle the file, it appears they also would rather believe the information in the shapefile header to determine the contents of the file:

>>> import fiona
>>> r = fiona.open('segments_shapefile/SegmentReference.shp')
>>> print(len(r))
135308
>>> r[135308]
ERROR 1: Attempt to read shape with feature id (135308) out of available range.

This does highlight however that pyshp's approach to reading corrupted files like this should be updated (see also #147, #223). It looks like a previous decision was made to believe the actual length of the shapefile rather than the number of shapes specified in the shapefile header, according to a comment in the code: Found shapefiles which report incorrect shp file length in the header. Can't trust that so we seek to the end of the file and figure it out. This is why shapes() and iterShapes() continue to read until the end of the file, which ends up failing when additional corrupted data has been written to the end of the file (as in this case). It should be noted that the dbf records() and iterRecords() do not read until the end of the file, but rather only according to the number of records stated in the dbf header.

In short, the question boils down to whether we believe the number of shapes as stated in the header, or based on the length of the file. Since the dbf reader goes by the header information, and since fiona/GDAL also appears to follow the header information, I'm inclined to think we should follow the header instead, potentially with a simple check to abort early if there's not enough data left in the file. Any opinion on this @GeospatialPython?

karimbahgat commented 2 years ago

One possible problem is that in the absence of the shx file, we don't know the number of shapes in the .shp file. A possible solution:

  1. When the shx file is present, set numShapes based on the number of shapes defined in the header, and use that to iterate.
  2. When a dbf file is present, use the number of dbf records in the header as the basis for iterating (a correct shapefile should have the same length). Once we have reached the end of the shp file, record the number of shapes by setting the numShapes attribute for metadata purposes.
  3. When we only have the shp file, have a conditional clause that instead iterates incrementally until we reach the shapefile length as specified in the shapefile header (shpLength). Once we have reached the end of the shp file, record the number of shapes by setting the numShapes attribute for metadata purposes.
  4. In all cases, check and raise an error if each read will exceed the actual length of the file (eg partially written shape, or incorrect shapefile file length header).

It would also be possible to skip through all shapes to count the number of shapes and using this to set the numShapes attribute when first loading the shapefile. Although we can probably do this pretty fast by only reading the first 8 bytes of each shape to skip to the start of the next shape, this might be an unnecessary amount of processing for a reader that should be as lazy as possible..

karimbahgat commented 2 years ago

This file should be read correctly in v2.3.0. More robust handling of shapefiles with corrupted data based on the discussion in this thread, believing the metadata rather than reading to the end of the file. See specific implementation details at d1a8fed31d871cbdb9db020bcc2d715036d3b33a and ff51716e7ffe2a09724f979844eb3bad5c27ebb3.