Closed INRIX-Mark-Gershaft closed 2 years ago
Interesting. First, can you report the pyshp version, and give the full python stacktrace/error message (not just the last line)?
Encoding errors are not typically OSError, so i think it's more likely what you say about byte unpacking/args when reading the .shp file.
In the for-loop, do you do any additional processing of the shapefile, or did you show the whole loop?
Best would be if there's any public links or github repos where i can access the file? Or depending on the filesize, if you can upload as an attachment or send as a file sharing link?
No, this is just smoke-test script that opens each and every shape file we produce to see if it can be read and that number of features is not 0. segments_shapefile.zip
Version installed: pyshp 2.1.3
Stack trace (I ran from interpreter session which is why there is no script line):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1337, in iterShapeRecords
for shape, record in izip(self.iterShapes(), self.iterRecords()):
File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1169, in iterShapes
yield self.__shape()
File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1106, in __shape
f.seek(next)
OSError: [Errno 22] Invalid argument
File is created using Java: https://docs.geotools.org/stable/javadocs/org/geotools/data/shapefile/ShapefileDataStore.html
Just verified - still reproducible in latest version of PyShp
Thanks for providing the file and the other details. After some testing, here is a rather lengthy response, which can be summarized as the file containing corrupted data towards the end of the file. It appears the error happens when trying to read beyond shape #135307, ie the error is with #135308:
>>> import shapefile
>>> r = shapefile.Reader('segments_shapefile.zip')
>>> for shape in r.iterShapes():
... pass
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1429, in iterShapes
shape = self.__shape(oid=i, bbox=bbox)
File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1353, in __shape
f.seek(next)
File "C:\Program Files\Python39\lib\tempfile.py", line 474, in func_wrapper
return func(*args, **kwargs)
OSError: [Errno 22] Invalid argument
>>> shape
Shape #135307: POLYLINE
>>> r.shape(135307)
Shape #135307: POLYLINE
>>> r.shape(135308)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1385, in shape
i = self.__restrictIndex(i)
File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1249, in __restrictIndex
raise IndexError("Shape or Record index out of range.")
IndexError: Shape or Record index out of range.
It looks like the shapefile has a length of 135308 but the reader is trying to read beyond the end of the file:
>>> r.numRecords # determined from dbf header
135308
>>> r.numShapes # determined from shx header
135308
>>> len(r._offsets) # index positions read from the shx file
135308
The error happens only with shapes()
and .iterShapes()
since these methods keep iterating until they have reached the actual end of the .shp file. Apparently, the .shp file contains 18464768 bytes, which is more than the end of the last 135307th shape located at byte position 18462948:
>>> # seek to end of last shape
>>> last_shape_offset = r._offsets[135307]
>>> r.shp.seek(last_shape_offset) # beginning of last shape
18462732
>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> print(recNum, recLength)
135308 104
>>> r.shp.tell() + (2 * recLength) # end of last shape according to shape header
18462948
>>> r.shp.seek(0, 2) # actual end of file
18464768
Reading the actual contents of the last shape confirms that the end of the last shape should be at byte position 18462948:
>>> from struct import unpack
>>> r.shp.seek(last_shape_offset) # beginning of last shape
18462732
>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> shapeType = unpack("<i", r.shp.read(4))[0]
>>> bbox = unpack("<4d", r.shp.read(32))
>>> nParts = unpack("<i", r.shp.read(4))[0]
>>> nPoints = unpack("<i", r.shp.read(4))[0]
>>> parts = unpack("<%si" % nParts, r.shp.read(nParts * 4))
>>> points = unpack("<%sd" % (2 * nPoints), r.shp.read(16*nPoints))
>>> r.shp.tell() # end of last shape according to data contents
18462948
Similarly, the shapefile header also agrees that the length of the file should be 18462948:
>>> # read the shapefile header's `File Length` entry
>>> r.shp.seek(24)
>>> unpack(">i", r.shp.read(4))[0] * 2
18462948
Trying to read the beginning of the next shape results in unexpected data:
>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> (recNum, recLength)
(50331648, -1049923325)
So this is an issue of the writer erroneously writing too much data, or the header containing incorrect header information, depending on how you view it. Since both the shx and dbf headers says the file length should be 135308, and since the beginning of the next shape contains unexpected data, I am inclined to believe that the extra data got corrupted in some way and that the headers represent the correct number of non-corrupted shapes/records. If I were to speculate I wonder if the java writer tried to write additional data, and when it failed for some reason it wrote the number of successfully written entries to the header, but did not remove the partially written extra data from the files.
Looking at how fiona/GDAL
handle the file, it appears they also would rather believe the information in the shapefile header to determine the contents of the file:
>>> import fiona
>>> r = fiona.open('segments_shapefile/SegmentReference.shp')
>>> print(len(r))
135308
>>> r[135308]
ERROR 1: Attempt to read shape with feature id (135308) out of available range.
This does highlight however that pyshp's approach to reading corrupted files like this should be updated (see also #147, #223). It looks like a previous decision was made to believe the actual length of the shapefile rather than the number of shapes specified in the shapefile header, according to a comment in the code: Found shapefiles which report incorrect shp file length in the header. Can't trust that so we seek to the end of the file and figure it out.
This is why shapes()
and iterShapes()
continue to read until the end of the file, which ends up failing when additional corrupted data has been written to the end of the file (as in this case). It should be noted that the dbf records()
and iterRecords()
do not read until the end of the file, but rather only according to the number of records stated in the dbf header.
In short, the question boils down to whether we believe the number of shapes as stated in the header, or based on the length of the file. Since the dbf reader goes by the header information, and since fiona/GDAL
also appears to follow the header information, I'm inclined to think we should follow the header instead, potentially with a simple check to abort early if there's not enough data left in the file. Any opinion on this @GeospatialPython?
One possible problem is that in the absence of the shx file, we don't know the number of shapes in the .shp file. A possible solution:
numShapes
based on the number of shapes defined in the header, and use that to iterate. numShapes
attribute for metadata purposes. shpLength
). Once we have reached the end of the shp file, record the number of shapes by setting the numShapes
attribute for metadata purposes. It would also be possible to skip through all shapes to count the number of shapes and using this to set the numShapes
attribute when first loading the shapefile. Although we can probably do this pretty fast by only reading the first 8 bytes of each shape to skip to the start of the next shape, this might be an unnecessary amount of processing for a reader that should be as lazy as possible..
This file should be read correctly in v2.3.0. More robust handling of shapefiles with corrupted data based on the discussion in this thread, believing the metadata rather than reading to the end of the file. See specific implementation details at d1a8fed31d871cbdb9db020bcc2d715036d3b33a and ff51716e7ffe2a09724f979844eb3bad5c27ebb3.
On some shape files containing data from OSM (i.e. could be encoding or special characters in road name column) I am getting OSError Invalid argument.
When I debugged what I see is that after it read last feature it is still not at the end of the file, thus in iterShapes() it thinks that there are more shapes and tries to read it:
Then in __shape() it parses recLength and it is garbage (some negative number), tries to compute next records' location and puts it somewhere beyond file's valid locations.
Since file has 600K features I don't really know how to help further. What is strange though is that QGIS and GeoPandas are able to parse same file.