Open wjones127 opened 1 year ago
Is schema negotiation outside the scope of this protocol? If get_schema()
contains a Utf8View
, for example, is it the consumer's responsibility to do the cast, or can the consumer pass a schema with Utf8View
columns as Utf8
to get_stream()
(or another method)?
Is schema negotiation outside the scope of this protocol?
I think we can include that. I'd like to design that as part of the PyCapsule API first, so we match the semantics there.
Haven't had time to work on this, but wanted to note here a current pain point for users of the dataset API is that there aren't table statistics the caller can access, and this leads to bad join orders. Some mentions of this here:
https://twitter.com/mim_djo/status/1740542585410814393 https://github.com/delta-io/delta-rs/issues/1838
Are we sure a blocking API like this would be palatable for existing execution engines such as Acero, DuckDB... ?
Of course, at worse the various method/function calls can be offloaded to a dedicated thread pool.
Are we sure a blocking API like this would be palatable?
Are you referring to the fact they would have to acquire the GIL to call these methods? Or something else?
Ideally all these methods are brief.
Though I haven't discussed this in depth with implementors of query engines. I'd be curious for their thoughts.
Are we sure a blocking API like this would be palatable?
Are you referring to the fact they would have to acquire the GIL to call these methods? Or something else?
No, to the fact that these functions are synchronous.
Ideally all these methods are brief.
I'm not sure. get_partitions
will typically have to walk a filesystem, which can be long-ish especially on large datasets or remote filesystems.
Perhaps get_partitions(...) -> Iterable[AbstractArrowScanner]
would do it? Not sure if anybody is interested in asyncio for this but an async iterator might work.
An Iterable
would probably be better indeed. It would not solve the async use case directly but we would at least allow producing results without blocking on the entire filesystem walk.
Describe the enhancement requested
Based on discussion in the 2023-08-30 Arrow community meeting. This is a continuation of https://github.com/apache/arrow/pull/35568 and https://github.com/apache/arrow/issues/33986.
We'd like to have a protocol for sharing unmaterialized datasets that:
This would provide a extendible connection between scanners and query engines. Data formats might include Iceberg, Delta Lake, Lance, and PyArrow datasets (parquet, JSON, CSV). Query engines could include DuckDB, DataFusion, Polars, PyVelox, PySpark, Ray, and Dask. Such a connection would let end-users employ their preferred query engine to load any supported dataset. From their perspective, usage would might look like:
The protocol is largely invisible to the user. Behind the scenes,
duckdb
would call__arrow_scanner__()
ontable
to get a scannable object. It would then pass down the column selection['y']
and the filterx > 3
to the scanner, and get the get the resulting data stream as input to the query.Shape of the protocol
The overall shape would look roughly like:
Data and schema are returned as C Data Interface objects (see: #35531). Predicates are passed as Substrait extended expressions.
Component(s)
Python