drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
472 stars 59 forks source link

Implement partial or streaming reads/writes (CloudFile abstraction) #9

Open pjbull opened 4 years ago

pjbull commented 4 years ago

Currently, if you want to read a file we download the whole file and then open it for reading. Most backends will have some way to do streaming reads. We may be able to improve the experience if we can do the same.

There may be some tricky bits with caching here. Can we stream to the user and then cache the streamed portion at the same time? Is there an async way that makes sense to do this?

pjbull commented 3 months ago

From @moradology in #455 originally:


Glad to see that there's some willingness to explore options outside the current scope of cloud providers. HTTP presents some unique issues vs the more-like-a-real-filesystem cloud provider options already supported - hopefully those don't prove to be more than an annoyance here.

One thing I'm wondering about is range reads. In boto, ranges can be read like this:

import boto3

s3 = boto3.client('s3')

bucket_name = 'your-bucket-name'
object_key = 'path/to/your/object'
start_byte = 0
end_byte = 1023  # First 1KB

response = s3.get_object(
    Bucket=bucket_name,
    Key=object_key,
    Range=f'bytes={start_byte}-{end_byte}'
)

data = response['Body'].read() # Just the bytes we want

I'm new to the lib and certainly haven't gone through the source in detail but I wonder how well the Path abstraction fits with this. Here's what a Pathlib (stdlib) read looks like for only selected ranges:

from pathlib import Path

file_path = Path('path/to/your/file')
start_byte = 0
end_byte = 1023  # First 1KB

with file_path.open('rb') as file:
    file.seek(start_byte)
    data = file.read(end_byte - start_byte + 1)

It occurs to me that the expected behavior in this instance is a bit ambiguous, right? Like, if the file is remote, would cloudpathlib behavior download the whole thing locally and then seek through the bytes or would it appropriately attempt to read only bytes as-needed?

pjbull commented 3 months ago

Like you point out, there is not a pathlib API for partial or streaming read/write. That is handled by the File abstractions in io.

Our current caching model is whole-file based. So the code above would execute, it just would download the whole file first, which is probably not what a user wants.

We have discussed CloudFile abstractions as a potential scope extension to enable these scenarios. Other folks have reported success with smart_open + cloudpathlib together for these scenarios. That said it is pretty complicated implementation (e.g., take a look at the smart_open s3 version). Given that scope, I think a File abstraction is a longer way out.

We also have discussed a CloudPath only read_range API. This would be substantially easier to implement, but breaks code that wants to handle both Path and CloudPath in the same way.

TomNicholas commented 3 months ago

Thanks so much for the engagement here @pjbull ! We're very interested in this (our entire stack / approach to science basically relies on a step like this).

We have discussed CloudFile abstractions

I like the idea of a CloudFile abstraction as a way to still use pathlib-like syntax for local/remote files. The idea of importing AnyPath and everything else just work is extremely enticing.

pjbull commented 3 months ago

Yeah, we also like the idea—just are a little wary of the implementation complexity and maintenance burden.

We'd be happy to consider a PR that does the following:

It may be the case that we decide it should be a separate repo/project that we have as an optional dependency and use if available, or we may decide it should be part of the cloudpathlib core.

msmitherdc commented 1 month ago

I've been doing this direct streaming via the smart-open method in #264