shapes and records are not pickable

GeospatialPython / pyshp

This library reads and writes ESRI Shapefiles in pure Python.

MIT License

1.09k stars 259 forks source link

shapes and records are not pickable #238

Closed pooya-mohammadi closed 2 years ago

pooya-mohammadi commented 2 years ago

records that are created using shapefile.Reader are pickable. I want to do a series of processes with multiprocessing but shapes does not become pickle objects and process fails.

import shapefile
import pickle

counties = shapefile.Reader("counties.shp")
records = counties.records()

buf = pickle.dumps(records[0])
record_0 = pickle.loads(buf)

Error message:

[Previous line repeated 995 more times]
RecursionError: maximum recursion depth exceeded

karimbahgat commented 2 years ago

From your code example, the part that fails is trying to pickle a record. Can you clarify if it's records or shapes that fail to pickle?

Either way, the error looks to be an issue of highly recursive class structures, though not sure why the Shape class would be recursive, it simply contains attributes of flat lists and a few string/int attributes.

For your use case however i would just suggest dumping to a json string instead of pickling, which is easier to work with and can easily be turned back into a Shape instance if needed:

import json
reader = ...
shape = reader.shape(0)
shape_geojson_string = json.dumps(shape.__geo_interface__)
# send to multiprocessing
# ...

pooya-mohammadi commented 2 years ago

@karimbahgat shapes are fine but the records are not pickable. I don't directly pickle them. Multiprocessing library pickles each task and sends it to a random worker.

karimbahgat commented 2 years ago

Not sure why the Record instances don't pickle, but this is a common problem with multiprocessing since not all python classes can be pickled. A common solution is to convert your data to a simpler intermediate format before sending to multiprocessing. So in your case, don't send the Record classes as input tasks to multiprocessing, instead convert to some other data structure such as a dict, eg record.as_dict(). Sending a dict to multiprocessing should be unproblematic. Remember to also update the worker script that does the actual work in multiprocessing so that it expects the same data structure.