apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Add `ObjectStore` support via SQL #1930

Open matthewmturner opened 2 years ago

matthewmturner commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

I am working towards making datafusion-cli a powerful tool to use locally for doing ad-hoc data analysis. The first step for that was #1875 which enables defining a local "database" that runs on startup with a .datafusionrc file. As a second step, I would like to be able to connect to object stores, such as S3, just from SQL. That will of course require adding s3 as a feature to datafusion-cli but that feature is useless unless ObjectStores can be registered. Below is the current behaviour:

❯ CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 's3://bucket/t.csv';
Internal("No suitable object store found for s3")

Describe the solution you'd like A clear and concise description of what you want to happen.

I would like to be able to register a ObjectStore just from SQL. Given that ObjectStore is a DataFusion concept I was thinking that we can add a function such as register_object_store, rather than having a SQL statement.

So it would look something like

Default credentials

❯   register_object_store('s3');

Minio

❯   register_object_store('s3', ACCESS_KEY, SECRET_KEY, PROVIDER, ENDPOINT);

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

matthewmturner commented 2 years ago

@seddonm1 @yjshen @houqp FYI - in case you have thoughts on this.

matthewmturner commented 2 years ago

actually, im not sure how well those parameters in register_object_store will generalize to other ObjectStore besides s3. so now im not sure if a general function like that could be used.

matthewmturner commented 2 years ago

maybe my objective could be achieved with some command line options instead. for example:

Default credentials

$ datafusion-cli --object-store s3

Minio

$ datafusion-cli --object-store s3 --access-key KEY --secret-key ABC --provider PROVIDER --endpoint ENDPOINT

@houqp @yjshen @seddonm1 do you have a view on whether ObjectStore registration can be done via SQL or if this should be part of datafusion-cli?

houqp commented 2 years ago

I think it can be done through both because secret key credentials and endpoint can be provided through environment variables as well. In this case, user will only need to provide the s3 path in the SQL query.

turbo1912 commented 1 year ago

@matthewmturner any progress on this one? If you are not working on it still, I would like to take a stab at it

seddonm1 commented 1 year ago

I think this repo is largely deprecated in favour of https://github.com/apache/arrow-rs/tree/master/object_store

matthewmturner commented 1 year ago

@matthewmturner any progress on this one? If you are not working on it still, I would like to take a stab at it

@turbo1912 Haven't been able to work on this, go for it!