Open osopardo1 opened 1 year ago
In DatasourceV2 there's also the possibility to build your own scan of the table, with more options than the Datasource V1 (which we are currently using).
Maybe it's worth to explore the DV2 API.
IMHO we need to explore the Datasource V2 API, possibly we will end to drop the V1. To support both can be too much conditional logic.
Yes, I agree. Do you think this can be done in the same PR #167 or it is best to do workaround first for Sampling and Limit Pushdown and migrate everything to V2 in a separate issue?
Well, I prefer to separate migration to the new versions of Spark/Delta and reworking the QbeastTable on the top of DataSource V2. So migration would mean that everything is compiling and running without new problems. And rework is a complex task, because they changed DataSource SPI a lot although it still has version V2. A good overview of the Spark 3.0 SPI could be found here https://blog.madhukaraphatak.com/categories/datasource-v2-spark-three/
I would like to share some thought on Spark 3.x.x DataSource API V2.
@osopardo1, @cugni , @Jiaweihu08 Could it make sense to create a temporary DataFrame to copy the data being written, and then to apply the algorithm we use now?
Thank you for the overview!
Very nice, a lot of code can be reused from OTreeIndex
, once the filters and everything is pushed down.
One solution for the Writer API is to keep a Fallback to Version 1. It is what we have implemented for the moment. The Writer Builder returns a V1Write, which will create an InsertableRelation
, that calls our methods in IndexedTable
for indexing and writing the DataFrame. I think we can migrate just Read features at the moment, while we consider moving everything else in the future.
Well, I prefer to separate migration to the new versions of Spark/Delta and reworking the QbeastTable on the top of DataSource V2. So migration would mean that everything is compiling and running without new problems. And rework is a complex task, because they changed DataSource SPI a lot although it still has version V2. A good overview of the Spark 3.0 SPI could be found here https://blog.madhukaraphatak.com/categories/datasource-v2-spark-three/
Noted. We are going to merge #167 first and then migrate to V2. We can also split the development of migration in two:
Technically I prefer 4 steps:
Probably small changes will be easier to review and to demonstrate
Plan looks good to me. 👍
I maintain this issue for future development plans. We need to rethink the design, the utility, and the properties involved.
Related to #166 .
Qbeast-Spark should be compatible with latest versions of Delta Lake and Apache Spark, to benefit from any new features and major upgrades. The change to Delta version 2.1.0 and Spark 3.3.0, reveal a set of interesting Pushdown operation that could be empowered with the Qbeast Metadata.
We should:
QbeastTableImpl
).QbeastTableImpl
.QbeastSparkSessionExtension
. The deletion of Sample Optimisation would also affect #68 . This requieres some more insights.*Thoughts on #68
OTreeIndex
(or from any other class that is involved, such asParquetFileFormat
) is correct.minWeight
andmaxWeight
, we can determine how many rows we can read from it. The only thing we need to find out is where those records are filtered.