apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.39k stars 1.21k forks source link

List available functions (`SHOW FUNCTIONS`) #12144

Open findepi opened 3 months ago

findepi commented 3 months ago

Is your feature request related to a problem or challenge?

I as a user would want to see a list of available functions.

Describe the solution you'd like

SHOW FUNCTIONS [ LIKE <pattern> ];

Describe alternatives you've considered

No response

Additional context

No response

jayzhan211 commented 3 months ago

I think we should support table function similar to https://duckdb.org/docs/sql/functions/overview.html

We could not only get the function name but also function_type, return_type, parameters ... and so on

select * from datafusion_functions()

findepi commented 3 months ago

We could not only get the function name but also function_type, return_type, parameters ... and so on

absolutely, more info than just function name is a good thing to have

I think we should support table function similar to https://duckdb.org/docs/sql/functions/overview.html

The SQL standard way of presenting function information seems to be via information_schema.routines view. This is similar to how tables' information is queryable. But then we still have SHOW TABLES on top of information_schema.tables for easier access.

We can certainly present this information as a table function (instead of, or additionally to). What would be the main advantage of doing so?

jayzhan211 commented 2 months ago

We can certainly present this information as a table function (instead of, or additionally to). What would be the main advantage of doing so?

We could easily apply any query on the table function like

SELECT DISTINCT ON(function_name)
    function_name,
    function_type,
    return_type,
    parameters,
    parameter_types,
    description
FROM duckdb_functions()
WHERE function_type = 'scalar'
  AND function_name LIKE 'b%'
ORDER BY function_name;

and we can get a subset of information with select

Unlike the syntax here, it is completely different so we need additional syntax support. It is also not easy to select the subset of information

https://docs.snowflake.com/en/sql-reference/sql/show-functions

SHOW FUNCTIONS [ LIKE '<pattern>' ]
               [ IN
                    {
                      ACCOUNT                       |

                      CLASS <class_name>            |

                      DATABASE                      |
                      DATABASE <database_name>      |

                      SCHEMA                        |
                      SCHEMA <schema_name>          |
                      <schema_name>
                    }
               ]
findepi commented 2 months ago

We could easily apply any query on the table function like

This is a good point. Would we get similar benefits if we implemented information_schema.routines view (which SQL spec happens to define)?

In any case, writing a full query to information_schema.* or to a table function, is a few key strokes. That's probably why we have SHOW TABLES statement even though the information is also available via information_schema.tabes and is more queryable there (one can select desired columns or apply filters).

Summing up we have discussed these implementation options

  1. information_schema.routines (as defined by the SQL spec)
  2. SHOW FUNCTIONS (similar to SHOW TABLES and similar (but not identical) to some query engines like Snowflake, Trino/Presto or MySQL)
  3. datafusion_functions() table function (similar to DuckDB's duckdb_functions())

Which one or which ones do we want to implement? How do arrive at this decision to make sure it's not questioned later by some project maintainer(s) that I failed to involve in the discussion?

jayzhan211 commented 2 months ago

I like the idea of datafusion_functions() as a table. I also prefer existing system (Postgres, DuckDB is my preference) than spec.

How do arrive at this decision to make sure it's not questioned later by some project maintainer(s) that I failed to involve in the discussion?

We could wait for a few days for more opionion. Don't worry about being questioned later on. Datafusion is a continuously evolving and breaking change friendly project, if there comes out a better idea we could review it again

alamb commented 2 months ago

It is my opinion we should follow the same pattern as implemented for SHOW TABLES as it seems to work well and allows for easier integration into third-party tools (like BI tools) that may query such tables for information.

Specifically, I think that means:

  1. implement the information_schema schema (potentially with some additional custom columns for DataFusion like help_text), etc
  2. Implement SHOW FUNCTIONS as a query on top of the information schema
  3. (maybe this is wishful thinking) automatically create function documentation by scraping the code (as we do for configuration values)
alamb commented 2 months ago

We could add a datafusion_functions table function in datafusion-cli or other downstream implementation perhaps

alamb commented 2 months ago

(maybe this is wishful thinking) automatically create function documentation by scraping the code (as we do for configuration values)

I filed https://github.com/apache/datafusion/issues/12432 to track this idea

goldmedal commented 1 month ago

Is there any update on this issue? In my downstream project, Wren AI, we would register various custom functions (UDF, UDAF, etc.) dynamically. This feature would be useful in our scenario for getting the available functions in the current session.

Currently, I can only access functions through SessionState::scalar_functions() or similar methods. However, obtaining parameter type or return type information is challenging.

As @alamb suggested, I think we can begin by implementing information_schema.

alamb commented 1 month ago

We are pretty close to completing the project to have the documentation programatically available

goldmedal commented 1 month ago

We are pretty close to completing the project to have the documentation programatically available

I checked the documents. It's very good for human reading but I think it doesn't have the type information (parameters and return_type). They're important for the third-party tool to integrate with DataFusion-based applications.

Following the information_schema way, I think we need routines and parameters (They're the Postgres tables but I think we don't need to implement all the column of theirs).

@findepi are you still working on this? I would like to push this feature. If you don't have bandwidth, I could help to implement the information_schema parts.

findepi commented 1 month ago

I paused this for now. @goldmedal go for it.

goldmedal commented 3 weeks ago

I created https://github.com/apache/datafusion/pull/13255 for the routines table. Just organize the related works here: