Open pjbull opened 4 years ago
From @moradology in #455 originally:
Glad to see that there's some willingness to explore options outside the current scope of cloud providers. HTTP presents some unique issues vs the more-like-a-real-filesystem cloud provider options already supported - hopefully those don't prove to be more than an annoyance here.
One thing I'm wondering about is range reads. In boto, ranges can be read like this:
import boto3
s3 = boto3.client('s3')
bucket_name = 'your-bucket-name'
object_key = 'path/to/your/object'
start_byte = 0
end_byte = 1023 # First 1KB
response = s3.get_object(
Bucket=bucket_name,
Key=object_key,
Range=f'bytes={start_byte}-{end_byte}'
)
data = response['Body'].read() # Just the bytes we want
I'm new to the lib and certainly haven't gone through the source in detail but I wonder how well the Path
abstraction fits with this. Here's what a Pathlib
(stdlib) read looks like for only selected ranges:
from pathlib import Path
file_path = Path('path/to/your/file')
start_byte = 0
end_byte = 1023 # First 1KB
with file_path.open('rb') as file:
file.seek(start_byte)
data = file.read(end_byte - start_byte + 1)
It occurs to me that the expected behavior in this instance is a bit ambiguous, right? Like, if the file is remote, would cloudpathlib
behavior download the whole thing locally and then seek through the bytes or would it appropriately attempt to read only bytes as-needed?
Like you point out, there is not a pathlib
API for partial or streaming read/write. That is handled by the File abstractions in io
.
Our current caching model is whole-file based. So the code above would execute, it just would download the whole file first, which is probably not what a user wants.
We have discussed CloudFile
abstractions as a potential scope extension to enable these scenarios. Other folks have reported success with smart_open + cloudpathlib together for these scenarios. That said it is pretty complicated implementation (e.g., take a look at the smart_open s3 version). Given that scope, I think a File
abstraction is a longer way out.
We also have discussed a CloudPath
only read_range
API. This would be substantially easier to implement, but breaks code that wants to handle both Path
and CloudPath
in the same way.
Thanks so much for the engagement here @pjbull ! We're very interested in this (our entire stack / approach to science basically relies on a step like this).
We have discussed
CloudFile
abstractions
I like the idea of a CloudFile
abstraction as a way to still use pathlib
-like syntax for local/remote files. The idea of importing AnyPath
and everything else just work is extremely enticing.
Yeah, we also like the idea—just are a little wary of the implementation complexity and maintenance burden.
We'd be happy to consider a PR that does the following:
open
implementationio
abstractions in the way we do for pathlib.Path
It may be the case that we decide it should be a separate repo/project that we have as an optional dependency and use if available, or we may decide it should be part of the cloudpathlib
core.
I've been doing this direct streaming via the smart-open method in #264
Currently, if you want to read a file we download the whole file and then open it for reading. Most backends will have some way to do streaming reads. We may be able to improve the experience if we can do the same.
There may be some tricky bits with caching here. Can we stream to the user and then cache the streamed portion at the same time? Is there an async way that makes sense to do this?