drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
471 stars 59 forks source link

Methods that move/remove files on the cloud don't remove the cached version #388

Open parviste-fortum opened 10 months ago

parviste-fortum commented 10 months ago

The following code

import cloudpathlib
s3_path = cloudpathlib.S3Path(f"s3://my-bucket/foo.txt")
s3_path.open("x").close() # Create empty file
s3_path.unlink()
s3_path.open("x").close() # Create empty file

does not work as expected. I would expect it to create a file, remove it, and then create it again. Instead, I get

Traceback (most recent call last):
  File "/workspaces/scoutingtool/backend/../cpltest.py", line 9, in <module>
    s3_path.open("x").close() # Create empty file
    ^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/cloudpathlib/cloudpath.py", line 503, in open
    buffer = self._local.open(
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileExistsError: [Errno 17] File exists: '/tmp/tmpg5cu85sn/my-bucket/foo.txt'

This seems to happen because the file is kept in cache, even when it was removed from S3. A simple solution would probably to use the "w" mode, even when "x" was supplied.

pjbull commented 10 months ago

Thanks for the bug report @parviste. Another workaround is to use a different caching mode that more aggressively clears the cache.

I think this is a good bug, and we don't currently update the cache on any of the cloud file move/delete scenarios. It seems worth a fix here to remove files from the cache for the following methods:

parviste-fortum commented 10 months ago

It's probably true that the cache should be cleared after moving/removing files, but at the same time, it also doesn't feel fully correct to necessarily write the cache files using the same mode as the files written to S3.

In particular, it would make sense to me to either:

  1. Always write cache files using mode "x" (even when the "w" mode was specified for opening the remote file), as this would allow catching bugs in cloudpathlib where a cached file was written twice (which shouldn't ever really happen?)
  2. Always write cache files using mode "w" (even when "x" mode was specified for opening the remote file), as this would allow us to be able to survive bugs like the one I encountered by simply overwriting the "stale" file, making the application more robust.

I can't really see a situation where it makes sense to write the cache files using the exact mode specified for the remote file, sometimes "x" sometimes "w".

pjbull commented 10 months ago

It keeps things simpler to not have special cases for different types of writes, so I'd rather have the cache state well managed than make assumptions with the writing mode. For example, if your original example was done with a instead of x and actually added some text to the file, the output would be wrong unless we both mirrored the state of the remote and opened the cache file in the same mode.

We currently do the following on writing to the cloud: (1) check if file is on the remote before writing with x and (2) refresh the cache before doing any writing,

I think the other change we want here is in _refresh_cache to delete the cache file if the remote file does not exist.