Open alamb opened 1 year ago
I'm happy to take on this issue if no one else is already working on it!
Enhancement: Instead of only relying on file extension name (as per the current implementation) we could use some inspiration from duckdb for loading/importing data such that the user can indicate the format to use (https://duckdb.org/docs/data/csv)
Eg: "select ... from read_csv('filename', options)" instead of plain "select .. from 'filename'"..
This would be helpful in cases where files are named "hitdata.tsv" (https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-overview.html?lang=en)
I would love to see a function like read_csv
or maybe read_file('filename')
👍
IMHO, these are table functions. I wonder does datafusion support table function now? ref. https://github.com/apache/arrow-datafusion/issues/3773
I agree with @unconsolable
I would love to see a function like
read_csv
or mayberead_file('filename')
👍
@alamb Are there any processes here? I'd like to try this.
@alamb Are there any processes here? I'd like to try this.
Hi @holicc -- no there is no progress on this issue that I know of.
Given the conversation above, it seems like there are two options:
read_file(...)
)datafusion-cli
does)Have you given any thought to the approach to take?
@alamb Maybe both options are good. Like DuckDB (see documentation: http://duckdb.org/docs/archive/0.8.1/data/csv/overview):
read_file
, you might need to specify file parse options, e.g. read_csv('simple.csv', delim='|', header=True)
.datafusion-cli
does it also makes it convenient for us.Having both makes sense to me.
I would say making a table function is likely a large piece of design work as there is no pre-existing model to follow for table functions (we probably need some sort of registry and pass that expands them out to some sort of TableProvider
I found out that we are using the sqlparser
crate to parse SQL. Therefore, we can modify the parsed statements with the visitorMut
trait, collect all tables, and replace them with a new table name. Finally, we can use the register_csv
API to register those tables. This way may be a little tricky.
I found out that we are using the sqlparser crate to parse SQL. Therefore, we can modify the parsed statements with the visitorMut trait, collect all tables, and replace them with a new table name. Finally, we can use the register_csv API to register those tables. This way may be a little tricky.
yes, we could
I would recommend thinking a little more holistically about table functions
Specifically adding them to the registry: https://docs.rs/datafusion/latest/datafusion/execution/registry/trait.FunctionRegistry.html
And then teaching our SQL planner how to look up those functions
IMHO, TableFunction
is essentially a TableProvider
. Therefore, we can make a new trait that returns a TableProvider
and execute the table function in create_relation
and using TableScan
to build a LogicPlan
Add a new trait:
#[async_trait]
pub trait TableFunction: Sync + Send {
async fn execute(
&self,
table: impl Into<OwnedTableReference>,
state: &SessionState,
) -> Result<Arc<dyn TableProvider>>;
}
add new method to FunctionRegistry
pub trait FunctionRegistry {
...
/// Returns a reference to the table function named `name`.
fn table_function(&self, name: &str) -> Result<Arc<dyn TableFunction>>;
}
Therefore, we can make a new trait that returns a TableProvider and execute the table function in create_relation and using TableScan to build a LogicPlan
I think one big difference is that a table function can take some number of parameters
SELECT * FROM read_parquet('test.parquet');
The table function is read_parquet(..)
which gets arguments of a string test.parquet
.
The SQL parser will give the argument as an Expr
.
https://docs.rs/sqlparser/latest/sqlparser/ast/enum.TableFactor.html#variant.TableFunction
So perhaps we could make the TableFuncton
trait like:
#[async_trait]
pub trait TableFunction: Sync + Send {
async fn execute(
&self,
arg: Expr,
state: &SessionState,
) -> Result<Arc<dyn TableProvider>>;
}
To take that arbitrary argument in
And what about multiple arguments? Eg: read_csv('blah.csv', delimiter = ';', ... )
And what about multiple arguments? Eg: read_csv('blah.csv', delimiter = ';', ... )
Good point @timvw -- we would probably want to allow that in the API. I tried it out and it does appear that the sql parser supports that kind of syntax 👍
❯ select foo from read_parquet('foo.parquet', 'bar');
Error during planning: table 'datafusion.public.read_parquet' not found
@alamb I'm having trouble inferring the schema because I can't get a SessionState from the ContextProvider. Can you help me?
let listing_options = CsvReadOptions::default().to_listing_options(&self.config);
let url = ListingTableUrl::parse(file_path)?;
let cfg = ListingTableConfig::new(url)
.with_listing_options(listing_options)
.with_schema(Arc::new(arrow_schema::Schema::empty()));
// FIXME How to get a SessionState?
cfg.infer_schema(state);
let table = ListingTable::try_new(cfg)?;
let source = Arc::new(DefaultTableSource::new(Arc::new(table)));
Hi @holicc -- I don't think it is possible to do this in the SQL parser level (as the the datafusion
crate depends on the datafusion-sql
crate, not the reverse.
Thus I think the resolution from a table name to a ListingTable may have to happen later in the process. In the Suggested Solution Sketch
above I tried to suggest taking inspiration from the dynamic lookup works today in datafusion-cli. Perhaps we can follow the same model here
cc @goldmedal here is an issue that describes supporting reading style urls as tables
select * from 's3://foo/bar.parquet'
Probably a good first step would be to move the DynamicFileCatalogProvider
https://github.com/apache/datafusion/blob/088ad010a6ceaa6a2e810d418a2370e45acf3d54/datafusion-cli/src/catalog.rs#L79 into the core somewhere (but not registering it with SessionContext
)
Then a second step would be to add an option (like the information_schema tables) that would enable installing the DynamicFileCatalogProvider
during construction
Probably a good first step would be to move the
DynamicFileCatalogProvider
into the core somewhere (but not registering it with
SessionContext
) Then a second step would be to add an option (like the information_schema tables) that would enable installing theDynamicFileCatalogProvider
during construction
HI @alamb, I'm working on it. I think I'll create PRs for each steps.
After roughly surveyed, I found I need to move not only DynamicFileCatalogProvider but also something object_store-related. I plan to place
DynamicFileCatalogProvider
in datafusion/core/src/catalog
datafusion/core/src/datasource/file_format
I haven't survey it. Seems I can refer to how information_schema implemented.
@alamb spark SQL syntax works like so:
select * from parquet.`s3://foo-bar`
what do you think?
.parquet
files are in nested foldersselect * from iceberg.
s3://foo-bar``@edmondop I think this would be a great to add as an example / thing to implement as an extension.
I wanted to confirm I understood correctly what options we are picking. It seems to me the following are viable:
We are taking option 1 right now, is that right @alamb ?
I am not quite sure what to do here to be honest
Update here is that @goldmedal made a PR for this issue https://github.com/apache/datafusion/pull/10745
However the initial PR brings many dependencies (like aws crates) to datafusion core that is likely not great. I had some suggestions on how we could split up the code to keep the dynamic file provider in the core whil ekeeping aws etc out: https://github.com/apache/datafusion/pull/10745#issuecomment-2175817937
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Similarly to https://github.com/apache/arrow-datafusion/issues/4580, I think systems built with datafusion would like to allow their users to quickly explore a parquet file with minimal typing
Today have to type a verbose
CREATE EXTERNAL TABLE
... commandIt is critical that this feature can be enabled/disabled so that DataFusion can provide read only access (rather than access to the file system as that would be a security hole)
I am marking this as a good first issue because I think all the code needed exists and there is a solution sketch below -- it should be a matter of coding that doesn't require huge existing knowledge of the datafusion codebase and would be a good exercise in getting familiar
@unconsolable added this ability into
datafusion-cli
as part of https://github.com/apache/arrow-datafusion/pull/4838 (❤️ )Describe the solution you'd like
I would like to be able to select directly from files (parquet, or other) from any datafusion session context, controlled by a setting. For example
Suggested Solution Sketch
Add a new config setting
files_as_tables
similar toinformation_schema
: https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/common/src/config.rs#L167-L169Add code to make a
ListingTable
inresolve_table_ref
: https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/src/execution/context.rs#L1551-L1560 (follow the model in https://github.com/apache/arrow-datafusion/pull/4838/files#diff-6353c2268d4d11abf8c1b8804a263db74a3b765a7302fc61caea3924256b52c7R142-R155)Move implementation from datafusion-cli; remove provider added in https://github.com/apache/arrow-datafusion/pull/4838 and use new setting instead https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion-cli/src/main.rs#L100 Add slt tests, similar to existing ones (should be able to refer to existing .parquet / .csv files in testing directories): https://github.com/apache/arrow-datafusion/blob/f9b72f4230687b884a92f79d21762578d3d56281/datafusion/core/tests/sqllogictests/test_files/information_schema.slt#L46
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context
Here is how information schema works, for reference.