drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
425 stars 50 forks source link

Multiple pickle roundtrip serializations cause KeyError #450

Open kujenga opened 3 weeks ago

kujenga commented 3 weeks ago

The following code causes a crash, which can arise when passing cloudpathlib objects to/from subprocesses.

    path1 = CloudPath("s3://bucket/key")
    print(path1.__dict__)
    pkl1 = pickle.dumps(path1)

    path2 = pickle.loads(pkl1)
    print(path2.__dict__)
    pkl2 = pickle.dumps(path2)

    path3 = pickle.loads(pkl2)
    print(path3.__dict__)

The exception raised is:

{'_handle': None, '_client': None, '_str': 's3://bucket/key', '_url': ParseResult(scheme='s3', netloc='bucket', path='/key', params='', query='', fragment=''), '_path': PurePosixPath('/bucket/key'), '_dirty': False}
{'_handle': None, '_str': 's3://bucket/key', '_url': ParseResult(scheme='s3', netloc='bucket', path='/key', params='', query='', fragment=''), '_path': PurePosixPath('/bucket/key'), '_dirty': False}
Traceback (most recent call last):
  File "/path/to/cloudpathlib-pickle-crash.py", line 20, in <module>
    main()
  File "/path/to/cloudpathlib-pickle-crash.py", line 13, in main
    pkl2 = pickle.dumps(path2)
           ^^^^^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.11/site-packages/cloudpathlib/cloudpath.py", line 266, in __getstate__
    del state["_client"]
        ~~~~~^^^^^^^^^^^
KeyError: '_client'

The issue seems to be that these methods are not symmetrical, and the _client field is not restored when the object is unpickled in a way that might mirror this example in the docs https://docs.python.org/3/library/pickle.html#pickle-state at this point in the code: https://github.com/drivendataorg/cloudpathlib/blob/08b018b36f90f89003e0c9e5ebd19030b41c2433/cloudpathlib/cloudpath.py#L262-L272

pjbull commented 3 weeks ago

Thanks @kujenga.

Since _client is created on demand I think the fix here is just to update __getstate__ with:

if "_client" in state:
    del state["_client"]

Would be happy to take a PR that does that and adds a test.

Thanks!

kujenga commented 5 days ago

Sounds good! PR submitted here: https://github.com/drivendataorg/cloudpathlib/pull/454