GeospatialPython / pyshp

This library reads and writes ESRI Shapefiles in pure Python.
MIT License
1.1k stars 260 forks source link

Error 'argument out of range', when .shp got 4.29 Gb. How to fix it? #198

Closed ivan-bulka closed 3 years ago

ivan-bulka commented 4 years ago

Hi, guys, Can you help me, please? I've got an error and spent the whole day solving it but did not make it. I work with Big data. Defined the function that read data from files and wrote in Shapefile. But always, when .shp file got 4.29 GB I got an error. It's my code:

w = shapefile.Writer(f'shapefiles/test/{final}', shapeType=5)
w.autoBalance = 1
w.field('point_1','C', 10)
with open(f'{final}_long_double.txt', 'r') as long:
    for i in long:
            i_1, i_2 = i.split()
            with open(f'{final}_short_double.txt', 'r') as short:
                for k in short:
                    k_1, k_2 = k.split()
                    value = [[[float(i_1),float(k_1)],[float(i_2),float(k_1)],[float(i_2),float(k_2)],[float(i_1),float(k_2)],[float(i_1),float(k_1)]]]
                    w.poly(value)
w.close()

It's the error:

error                                     Traceback (most recent call last)
<ipython-input-23-5c0408f04667> in <module>
     14                     value = [[[float(i_1),float(k_1)],[float(i_2),float(k_1)],[float(i_2),float(k_2)],[float(i_1),float(k_2)],[float(i_1),float(k_1)]]]
     15 #                     row_str = f'{i_1}', f'{k_1}', f'{i_2}', f'{k_2}'
---> 16                     w.poly(value)
     17 #                     w.record()
     18 w.close()

C:\ProgramData\Anaconda3\lib\site-packages\shapefile.py in poly(self, polys)
   1693         If some of the polygons are holes, these must run in a counterclockwise direction."""
   1694         shapeType = POLYGON
-> 1695         self._shapeparts(parts=polys, shapeType=shapeType)
   1696 
   1697     def polym(self, polys):

C:\ProgramData\Anaconda3\lib\site-packages\shapefile.py in _shapeparts(self, parts, shapeType)
   1758                 polyShape.points.append(point)
   1759         # write the shape
-> 1760         self.shape(polyShape)
   1761 
   1762     def field(self, name, fieldType="C", size="50", decimal=0):

C:\ProgramData\Anaconda3\lib\site-packages\shapefile.py in shape(self, s)
   1351         # Write to file
   1352         offset,length = self.__shpRecord(s)
-> 1353         self.__shxRecord(offset, length)
   1354 
   1355     def __shpRecord(self, s):

C:\ProgramData\Anaconda3\lib\site-packages\shapefile.py in __shxRecord(self, offset, length)
   1499          """Writes the shx records."""
   1500          f = self.__getFileObj(self.shx)
-> 1501          f.write(pack(">i", offset // 2))
   1502          f.write(pack(">i", length))
   1503 

error: argument out of range

When I work with little number of data, the script works. But, when with large - I got an error. The same issue I have on ubuntu and windows. Why does it happen and how to solve it? Thank you for your help.

karimbahgat commented 4 years ago

The 4.29 Gb file size indeed is what causes the problem, or more specifically, the number of records in the shapefile.

Can you check the length of the shapefile (len(w)) or number of records added? Most likely it's larger than 2,147,483,647 (2.1 billion) records. Your code throws the error when it's trying to write the index number of the added record in the shx (index) file, but the index of the record you are adding seems to be higher than the maximum allowed value of the unsigned integer (2,147,483,647) used to store the index number. This is per the format specification, and is therefore a limitation of the shapefile file format. So nothing really to do here except maybe split up your data into multiple files.

At least it's good to know that 2.1b records is the limit in terms of doing big data with shapefiles.

ivan-bulka commented 4 years ago

Hi, Karim, Thank you for the response. Checked the length of the file. I have 31580643 records only. It means that we did not exceed the limit of records in the file. We also check our data. It's correct and does not have any gaps or None values. Might it be because the logic of my algorithm is incorrect?

Thanks,

karimbahgat commented 4 years ago

After a second look, it seems I misspoke. The line that fails is f.write(pack(">i", offset // 2)), meaning it's not writing the index of the shape being added, but rather the byte offset (the position in the file where it's located). The // 2 part is because the shapefile spec says to write the offset and length as "16 bit words", in increments of 2 bytes.

This means, if the code is working correctly and the offset is indeed surpassing the limit of 2,147,483,647, then the true byte offset is double that, 4,294,967,294. That happens to equal 4.29 GB, the file size you say you're reaching. So this indeed looks like a legitimate error based on how shapefiles store their offset values using unsigned integers.

So I was wrong to say before that it's a limitation on the number of shapes; rather it's the combined total byte size of all your shapes, meaning the .shp file cannot exceed 4.29 GB. Unfortunately, this means you have reached the file size limit of the shapefile format and have to break up the file into multiple.

karimbahgat commented 4 years ago

Note to self: raise more informative exception when exceeding the max shapefile size

karimbahgat commented 3 years ago

Added explicit exception for this, available in next version.