drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
476 stars 62 forks source link

Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

Open hugolytics opened 2 years ago

hugolytics commented 2 years ago

The current implementation of the .open methods consists of a local cache which is then synchronized with the cloud.

This method can be replaced by smart_open, to allow for a more efficient mechanism.

One can take inspiration from aws' S3PathLib, (however, that library handles boto session in a way that is not thread-safe, which has made me switch to this library).

Currently, I subclassed S3Path and implemented the aforementioned S3Pathlib's implementation as follows:

from cloudpathlib import S3Path as BaseS3Path
import smart_open

# replace the .open method of S3Path with smart_open
class S3Path(BaseS3Path):
    def open(
            self,
            mode="r",
            buffering=-1,
            encoding=None,
            errors=None,
            newline=None,
            closefd=True,
            opener=None,
            ignore_ext=False,
            compression=None,
            api_kwargs: dict = None, # type: ignore
        ):
            """
            Open S3Path as a file-liked object.
            :return: a file-like object.
            See https://github.com/RaRe-Technologies/smart_open for more info.
            """

            kwargs = dict(
                uri=self.as_uri(),
                mode=mode,
                buffering=buffering,
                encoding=encoding,
                errors=errors,
                newline=newline,
                closefd=closefd,
                opener=opener,
                transport_params={"client": self.client}
            )
            return smart_open.open(**kwargs)

    def read_text(
        self,
        encoding="utf-8",
        errors=None,
    ) -> str:
        with self.open(
            mode="r",
            encoding=encoding,
            errors=errors,

        ) as f:
            return f.read()

    def read_bytes(self, ) -> bytes:
        with self.open(mode="rb") as f:
            return f.read()

    def write_text(
        self,
        data: str,
        encoding="utf-8",
        errors=None,
        newline=None,

    ):
        with self.open(
            mode="w",
            encoding=encoding,
            errors=errors,
            newline=newline,
        ) as f:
            f.write(data)

    def write_bytes(self, data: bytes):
        with self.open(mode="wb") as f:
            f.write(data)

However, I could also open a pull request to merge this with the cloudpath definition, since smart_open is cloud-agnostic.

pjbull commented 2 years ago

Thanks @hugolytics for your thoughts here. It's definitely interesting to think about ways to leverage smart_open, especially given the breadth of backends it supports. We have a number of features we are considering that it could help with (#9, #10, #29).

To me, this issue is similar to the discussion in #96 and #109. There are certainly backend packages like this that handle the operational side of things that we should consider, since our primary purpose is supporting the pathlib API.

That said, we won't merge the PR (#265 ) as is for a number of reasons, so I'm going to close it:

Note that it is worth bumping #92 that lists these alternatives

msmitherdc commented 1 year ago

I’d love to see smart_open added here also, as an alternative to the cache concept. We have to open large zip files (10s-100s of gb) just to read some content in place and this all works well with cloudpathlib and smart-open (in my fork).

pjbull commented 1 year ago

@msmitherdc Thanks for the comment. To better understand your use case, what are the specific things that you want smart_open for? Is it streaming/partial reads/writes or something beyond that case?

msmitherdc commented 1 year ago

we are opening 3dtiles and i3s (slpk) mesh files. These are large zip files that we read json files out of. We use reads of the files out of the zip to get info about the mesh. For serving them out for cesium, we read byte ranges and serve them out to the client. So streaming and partial reads.