apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.31k stars 1.19k forks source link

Make all SchemaProvider trait APIs async #10339

Open metesynnada opened 6 months ago

metesynnada commented 6 months ago

Is your feature request related to a problem or challenge?

Currently, only table API is async, but register_table, deregister_table, table_names, and table_exist can be also async. Also, most of the usage is under async methods, thus there shouldn't be a big issue implementing it.

However, this can be a huge change for the downstream usage, this is why I need some insides before me or someone else implements it.

cc @Dandandan @alamb

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

matthewmturner commented 6 months ago

This one is actually quite interesting. I hadnt gotten around to it yet but I was going to propose having both sync and async apis for SchemaProvider (I think theres valid use cases for both). I just recently ran into an issue where I wanted the ctx.table_provider and therefore table method on SchemaProvider to be sync (as part of logical plan serialization which is sync). How would you feel about having both sync and async APIs?

liurenjie1024 commented 6 months ago

+1 for this proposal. We ran into similar problems when implementing datafusion/iceberg integrations: https://github.com/apache/iceberg-rust/pull/324

It's more nature to allow catalog/schema operations to be async.

alamb commented 6 months ago

Yes, I agree allowing catalog APIs to be async makes sense to me

One rationale against async, as I recall from @tustvold , was to encourage people to implement efficient catalog implementations (e.g. for many cases calling a network call for table_exists might be slow)

However, my personal opinion is that such encouragement can be done via documentation and if people want to implement RPC network calls during planning then the APIs shouldn't stop them

I think the biggest challenge is, as @metesynnada hints at above, the viral nature of async -- if we make such APIs async then everywhere they are called must also be be async -- I haven't looked at how far down the stack that is but it could be substantail.

An alternate approach might be to implement, via some hackery and tokio channels, an struct that implements the SchemaProvider without changes (sync) but can call async methods (though that would block the runtime thread 🤔 )

tustvold commented 6 months ago

I seem to remember omitting the methods other than table due to the sheer size of the resulting change, I have no objection to making them async