Closed Cheappie closed 2 years ago
I'm not sure about this, the Path
is very specifically just that a path, not a URL. This keeps it simple and ensures consistent behaviour across object storage implementations. Any additional URL parameters, e.g. host, credentials, ports, etc... are to be handled at the ObjectStore level, i.e. passed to it on construction.
To give an example of where ambiguity would arise, if I call the list API with a URL
instead of a prefix, what should the store do, should it ignore the extra arguments, should the returned Path also contain these arguments?
Could you perhaps expand a bit more on your use case?
Sure, let me briefly explain what I am trying to achieve. I have implemented custom TableProvider in order to embed communication with external service that performs data-source indexing and produces binary stream with complete listing over data-source for given parameters that are part of SQL query expression. In addition to that I have implemented custom ObjectStore that provides just get operations.
As you can see, what I would prefer most is to be able to provide binary blob instead of either Path or Url to get operation. But Url is closest thing that shares similarities with Path and is more flexible.
Well in my case with existing ObjectStore trait, I will have to convert my binary listing to text and I will have to attach arguments to every Url unfortunately.
To give an example of where ambiguity would arise, if I call the list API with a URL instead of a prefix, what should the store do, should it ignore the extra arguments, should the returned Path also contain these arguments?
Another layer of abstraction is my implementation of ObjectStore that embeds tiered storage, I don't know where would be better place to put that. Let me get to the point, I do need at least ObjectMeta struct in get operations of ObjectStore instead of just Path because I do need to invalidate tiered storage when last modified time doesn't match. I wouldn't mind if we could in addition to passing ObjectMeta to ObjectStore get ops, create new field within ObjectMeta like custom_attributes: Option<Box<[u8]>> or Option<Arc<[u8]>>
This might be a stupid question, but if you already have a custom TableProvider what is the motivation for using the ObjectStore trait at all? It feels like you're using it to abstract over something that fundamentally is not an object store?
Because I am using parquet format, that allows me to create physical execution plan based on this file format. And in order to provide data I need to communicate with existing implementation through ObjectStore.
What do you think about modifying ObjectStore get operations to replace Path with ObjectMeta and adding custom_attributes: Option<Box<[u8]>>
to ObjectMeta ?
Ok, so if I understand the issue correctly:
This is a very similar problem that IOx has and it historically solved this by not using DataFusion's parquet support and using the parquet crate directly, in particular it would:
Bytes
in memory or a file on diskparquet
crate directly to scan these "files" using a custom ExecutionPlan
A while back I created some tickets related to making DataFusion's abstractions more flexible, but I've not yet had time to finish it up and I would describe the interface currently as rather limiting.
To me the issue here is that DataFusion's ParquetExec
is very tightly coupled with both where the data is located and how it is fetched. There are two solutions to this in my mind:
parquet
so that it can be reused by custom ExecutionPlanWhat do you think?
FYI @crepererum @alamb
What do you think about modifying ObjectStore get operations to replace Path with ObjectMeta and adding custom_attributes: Option<Box<[u8]>> to ObjectMeta ?
I think this is at odds with the objectives of the crate to provide a consistent abstraction across object stores, and so I would be extremely reticent to change the API in this way
I created https://github.com/apache/arrow-rs/issues/2241 to track adding support for request preconditions
I have taken a look into datafusion codebase and from what I have seen baking my own ExecutionPlan to replace ObjectStore would require quite large effort. Mainly rewriting ParquetExec, FileScanConfig, and some other components along the way that are bound to ObjectStore.
I feel like anyway we go, at the end there will be similar or yet another abstraction of ObjectStore
in DataFusion. That will have more relaxed semantics than the one that is being moved into arrow.
I like the idea of request preconditions, but It won't be sufficient for me. I've remembered that I would need to identify who has requested object from ObjectStore. It's good example of why It would be nice to have ability to pass dynamic attributes through something like Box<[u8]>
.
What do you think about modifying ObjectStore get operations to replace Path with ObjectMeta and adding custom_attributes: Option<Box<[u8]>> to ObjectMeta ?
I think this is at odds with the objectives of the crate to provide a consistent abstraction across object stores, and so I would be extremely reticent to change the API in this way
Could you elaborate ?
What is wrong in providing ObjectMeta to get operation ? I mean that is the output of list operation. Or are you worried about custom attributes field ?
Ahh I see now why you do want to uphold existing interface. You just want to keep it simple, because either object stores or file system just require location to get object. Am I right ?
I think that there should be yet another ObjectStore trait in DataFusion, that should use output of list method as an input to get with some additional flexibility for dynamic attributes.
You just want to keep it simple, because either object stores or file system just require location to get object. Am I right ?
Yeah, you most definitely shouldn't need the object meta to request a file, although the request preconditions (#2241) would optionally allow restrictions based on a limited subset of the metadata, constrained by what is generally supported.
I think that there should be yet another ObjectStore trait in DataFusion, that should use output of list method as an input to get with some additional flexibility for dynamic attributes.
I agree that this should be handled at the DataFusion layer, although I'm not entirely sure what the interface should look like. Personally I would be reticent to overload the ObjectStore name, as I think that would get very confusing, but the broad idea of introducing more flexibility into the IO operators sounds like a good thing imo :+1:
One simple option might be to adapt the existing FormatReader
trait to this purpose or something? :thinking: Tbh what you really want is to be able to override the AsyncFileReader
that is passed to ParquetRecordBatchStreamBuilder
by ParquetOpener
. This feels like it should be imminently achievable without major rework. It also occurs to me that having separate ParquetExec, JsonExec, is basically redundant now that we have the FormatReader
abstraction.
If you like I'd be happy to raise an issue on DataFusion proposing something to this effect, and depending on feedback potentially bash it out for you? Also happy to leave this with you, just let me know. I'd very much like to help you get this over the line :smile:
Edit: I wrote up something https://github.com/apache/arrow-datafusion/issues/2992 PTAL and let me know what you think
My 2 cents regarding path vs URL/URI:
The address of an object within a store is relative, i.e. a path (or a path segment if you want) but a URL/URI is absolute. It is semantically kinda weird that a single object store (e.g. an S3 store configured for one account+bucket) answers URL-based requests. I think if you want something that abstracts over multiple stores, we should introduce another layer on top, e.g.:
trait ObjectStoreRegistry {
fn register(&mut self, name: &str, store: Arc<dyn ObjectStore>);
fn map(&self, url: Url) -> Option<(Arc<dyn ObjectStore> Path)>;
}
// usage
let mut registry = make_registry();
registry.register("s3_prod", make_store(...));
registry.register("s3_dev", make_store(...));
registry.register("local", make_store(...));
let (store, path) = registry.map("s3_dev://my/file.parquet".into()).unwrap();
store.get(path)...
Agreed, DataFusion in fact already has such an abstraction - https://docs.rs/datafusion/latest/datafusion/datasource/object_store/struct.ObjectStoreRegistry.html
Closing this as I believe it is now resolved, feel free to re-open if I am mistaken
Is your feature request related to a problem or challenge? Please describe what you are trying to do. I have implemented custom
TableProvider
andObjectStore
in datafusion, what I would like to do is to have more convenient interface that would allow me to pass dynamic parameters to ObjectStore. Currently ObjectStore accepts Path as parameter, and Path impl is a bit restrictive and provides guarantees that are unnecessary(at least for me).Describe the solution you'd like Replace Path internals, precisely exchange raw String for Url and then provide
AsRef<Url>
for Path.Describe alternatives you've considered Alternatively ObjectMeta could be provided to ObjectStore instead of just Path, that would allow me to perform some checks like whether last modified time matches. However It might be still good idea to switch internal Path representation to Url to make it more flexible.
Additional context