Open hugolytics opened 2 years ago
Thanks @hugolytics for your thoughts here. It's definitely interesting to think about ways to leverage smart_open
, especially given the breadth of backends it supports. We have a number of features we are considering that it could help with (#9, #10, #29).
To me, this issue is similar to the discussion in #96 and #109. There are certainly backend packages like this that handle the operational side of things that we should consider, since our primary purpose is supporting the pathlib API.
That said, we won't merge the PR (#265 ) as is for a number of reasons, so I'm going to close it:
smart_open
is designed explicitly for streaming, but there are workflows that benefit from the local cache architecture instead. Ideally we'd support both.smart_open
but keep the consistency that our current *Client/*Path
APIs support.Note that it is worth bumping #92 that lists these alternatives
I’d love to see smart_open added here also, as an alternative to the cache concept. We have to open large zip files (10s-100s of gb) just to read some content in place and this all works well with cloudpathlib and smart-open (in my fork).
@msmitherdc Thanks for the comment. To better understand your use case, what are the specific things that you want smart_open
for? Is it streaming/partial reads/writes or something beyond that case?
we are opening 3dtiles and i3s (slpk) mesh files. These are large zip files that we read json files out of. We use reads of the files out of the zip to get info about the mesh. For serving them out for cesium, we read byte ranges and serve them out to the client. So streaming and partial reads.
The current implementation of the .open methods consists of a local cache which is then synchronized with the cloud.
This method can be replaced by smart_open, to allow for a more efficient mechanism.
One can take inspiration from aws' S3PathLib, (however, that library handles boto session in a way that is not thread-safe, which has made me switch to this library).
Currently, I subclassed S3Path and implemented the aforementioned S3Pathlib's implementation as follows:
However, I could also open a pull request to merge this with the cloudpath definition, since smart_open is cloud-agnostic.