Open findepi opened 3 months ago
I think we should support table function similar to https://duckdb.org/docs/sql/functions/overview.html
We could not only get the function name but also function_type
, return_type
, parameters
... and so on
select * from datafusion_functions()
We could not only get the function name but also
function_type
,return_type
,parameters
... and so on
absolutely, more info than just function name is a good thing to have
I think we should support table function similar to https://duckdb.org/docs/sql/functions/overview.html
The SQL standard way of presenting function information seems to be via information_schema.routines
view. This is similar to how tables' information is queryable. But then we still have SHOW TABLES on top of information_schema.tables
for easier access.
We can certainly present this information as a table function (instead of, or additionally to). What would be the main advantage of doing so?
We can certainly present this information as a table function (instead of, or additionally to). What would be the main advantage of doing so?
We could easily apply any query on the table function like
SELECT DISTINCT ON(function_name)
function_name,
function_type,
return_type,
parameters,
parameter_types,
description
FROM duckdb_functions()
WHERE function_type = 'scalar'
AND function_name LIKE 'b%'
ORDER BY function_name;
and we can get a subset of information with select
Unlike the syntax here, it is completely different so we need additional syntax support. It is also not easy to select the subset of information
https://docs.snowflake.com/en/sql-reference/sql/show-functions
SHOW FUNCTIONS [ LIKE '<pattern>' ]
[ IN
{
ACCOUNT |
CLASS <class_name> |
DATABASE |
DATABASE <database_name> |
SCHEMA |
SCHEMA <schema_name> |
<schema_name>
}
]
We could easily apply any query on the table function like
This is a good point.
Would we get similar benefits if we implemented information_schema.routines
view (which SQL spec happens to define)?
In any case, writing a full query to information_schema.*
or to a table function, is a few key strokes. That's probably why we have SHOW TABLES
statement even though the information is also available via information_schema.tabes
and is more queryable there (one can select desired columns or apply filters).
Summing up we have discussed these implementation options
information_schema.routines
(as defined by the SQL spec)SHOW FUNCTIONS
(similar to SHOW TABLES
and similar (but not identical) to some query engines like Snowflake, Trino/Presto or MySQL)datafusion_functions()
table function (similar to DuckDB's duckdb_functions()
)Which one or which ones do we want to implement? How do arrive at this decision to make sure it's not questioned later by some project maintainer(s) that I failed to involve in the discussion?
I like the idea of datafusion_functions()
as a table
. I also prefer existing system (Postgres, DuckDB is my preference) than spec.
How do arrive at this decision to make sure it's not questioned later by some project maintainer(s) that I failed to involve in the discussion?
We could wait for a few days for more opionion. Don't worry about being questioned later on. Datafusion is a continuously evolving and breaking change friendly project, if there comes out a better idea we could review it again
It is my opinion we should follow the same pattern as implemented for SHOW TABLES
as it seems to work well and allows for easier integration into third-party tools (like BI tools) that may query such tables for information.
Specifically, I think that means:
information_schema
schema (potentially with some additional custom columns for DataFusion like help_text
), etcSHOW FUNCTIONS
as a query on top of the information schemaWe could add a datafusion_functions
table function in datafusion-cli or other downstream implementation perhaps
(maybe this is wishful thinking) automatically create function documentation by scraping the code (as we do for configuration values)
I filed https://github.com/apache/datafusion/issues/12432 to track this idea
Is there any update on this issue? In my downstream project, Wren AI, we would register various custom functions (UDF, UDAF, etc.) dynamically. This feature would be useful in our scenario for getting the available functions in the current session.
Currently, I can only access functions through SessionState::scalar_functions()
or similar methods. However, obtaining parameter type or return type information is challenging.
As @alamb suggested, I think we can begin by implementing information_schema
.
We are pretty close to completing the project to have the documentation programatically available
We are pretty close to completing the project to have the documentation programatically available
- [EPIC] Automatically generate all function documentation from code #12740 I think this wuold be straightforward to add now
I checked the documents. It's very good for human reading but I think it doesn't have the type information (parameters and return_type). They're important for the third-party tool to integrate with DataFusion-based applications.
Following the information_schema
way, I think we need routines and parameters (They're the Postgres tables but I think we don't need to implement all the column of theirs).
@findepi are you still working on this? I would like to push this feature. If you don't have bandwidth, I could help to implement the information_schema parts.
I paused this for now. @goldmedal go for it.
I created https://github.com/apache/datafusion/pull/13255 for the routines
table. Just organize the related works here:
Is your feature request related to a problem or challenge?
I as a user would want to see a list of available functions.
Describe the solution you'd like
Describe alternatives you've considered
No response
Additional context
No response