apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.9k stars 3.38k forks source link

[Java] Arrow-to-JDBC #20730

Open asfimport opened 5 years ago

asfimport commented 5 years ago

ARROW-1780 reads a query from a JDBC data source and converts the ResultSet to an Arrow VectorSchemaRoot.  However, there is no built-in adapter for writing an Arrow VectorSchemaRoot back to the database.

ARROW-3966 adds JDBC field metadata:

Reporter: Michael Pigott / @mikepigott

Note: This issue was originally created as ARROW-4144. Please see the migration documentation for further details.

asfimport commented 4 years ago

Chen / @dota17: is it necessary to do it,?Arrow is designed for in-memory data analysis, the data won't be modified. 

recently i pay much  attention on Arrow, i want to make a contribution to the project.

if this issue is necessary, i am happy to resolve it.

asfimport commented 4 years ago

Michael Pigott / @mikepigott: Hi, and thanks for taking a look at this! I saw Arrow as a tool that could be used to transfer data between systems, and not just for in-memory analysis. However, I don't need this today, and if nobody else has asked for it, then there might be more valuable JIRA tickets than mine for you to dedicate your time to?

Thanks again! Mike

asfimport commented 4 years ago

Micah Kornfield / @emkornfield: @xhochy have you come across a use-case for writing to JDBC sources?

asfimport commented 4 years ago

Uwe Korn / @xhochy: Yes, the usecase would be to write large pandas.DataFrames to a database layer that only has performant JDBC drivers. Personally, all my JDBC sources are read-only and thus I didn't write a WriteToJDBC function but other people will also use these technologies with more access rights. I have used the "pyarrow->Arrow Java -> JDBC" successfully with Apache Drill and Denodo. I also heard that some people use it together with Amazon Athena and here a performant INSERT might be interesting https://docs.aws.amazon.com/athena/latest/ug/insert-into.html as the JDBC driver seems to be the most performant currently.

asfimport commented 4 years ago

Gleb: Our team is also in need for this feature.

Currently we are implementing cross-language data processing inside k8s, the DB component is responsible for pushing/pulling DB data. Arrow is chosen as a data format due to efficiency and gRPC support from the box. The only thing that we lack when working with JDBC sources - is the ability to convert data back to JDBC from Arrow.

asfimport commented 1 year ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

wgtmac commented 1 year ago

I am looking for a similar feature and came across this issue. I know we have supported JDBC to Arrow but not in the inverse direction. Do we have any plan to support it? @lidavidm

lidavidm commented 1 year ago

https://github.com/apache/arrow-adbc/tree/main/java/driver/jdbc can do it but needs assistance generating the SQL

https://github.com/apache/arrow/blob/main/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcParameterBinder.java requires you to set up the PreparedStatement and everything and solely automates the work of converting Arrow data to JDBC statement parameters

If one of those works, I think we should document the possibilities and close this issue

wgtmac commented 1 year ago

https://github.com/apache/arrow-adbc/tree/main/java/driver/jdbc can do it but needs assistance generating the SQL

https://github.com/apache/arrow/blob/main/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcParameterBinder.java requires you to set up the PreparedStatement and everything and solely automates the work of converting Arrow data to JDBC statement parameters

If one of those works, I think we should document the possibilities and close this issue

None of them work for my case. My goal is to wrap a set of VectorSchemaRoot into JDBC ResultSet in the read path.

lidavidm commented 1 year ago

Ah. Interesting. Sorry, I misunderstood.

There is stuff like that in the JDBC driver but not exposed for public use. What is the use case?

wgtmac commented 1 year ago

To provide standard JDBC access to a database or engine that speaks arrow vector natively.

lidavidm commented 1 year ago

Ok, cool. (So this is like an embeddable engine? I assume otherwise you could use Flight SQL.)

Your best bet is to copy or refactor the code that is part of the JDBC driver. There isn't a separate adapter for this but it would make sense to refactor things into arrow-jdbc if they are useful more widely.

lidavidm commented 1 year ago

This is also different from this original issue, since this issue was about writing Arrow data to a database via JDBC

wgtmac commented 1 year ago

Did you mean this? https://github.com/apache/arrow/blob/main/java/flight/flight-sql-jdbc-driver/src/main/java/org/apache/arrow/driver/jdbc/ArrowFlightJdbcVectorSchemaRootResultSet.java

lidavidm commented 1 year ago

Yes, code like that. (I suppose the Avatica dependency means we may not want to add it to arrow-jdbc, but it's already public - does it sort of work? We can talk about refactoring it if it meets your needs.)

wgtmac commented 1 year ago

Yes, it seems like what I am searching for. I need some time to read into it before a concrete conclusion. Thanks for your help!

wgtmac commented 1 year ago

@lidavidm Can we add a separate flight-sql-jdbc-core module and move the common code here?

The issue is that flight-sql-jdbc-driver has relocated a lot of classes including VectorSchemaRoot: https://github.com/apache/arrow/blob/main/java/flight/flight-sql-jdbc-driver/pom.xml#L193. This makes it difficult to directly develop other JDBC drivers based on it.

lidavidm commented 1 year ago

If the ResultSet implementation (and any other functionality?) can be cleanly separated out and is useful more broadly, then that sounds reasonable to me.

wgtmac commented 1 year ago

I will try it once I have some time.