apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Set-returning UDFs #1604

Open lightjacket opened 2 years ago

lightjacket commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. I'm looking at add an inverted index on my data and use it as the first step in more complicated data transformations. There's particular optimization that I think might be better done outside of the DataFusion engine.

Describe the solution you'd like I'm imagining something like the following, where inverted_index_search returns an arbitrary number of rows.

SELECT inverted_index_search('some query');
+------+
| id   |
+------+
|  1   |
|  2   |
+------+

I'd like to be able to write a udf (or udaf, or something new) that can return as many rows as it needs.

Describe alternatives you've considered I looked at having my UDF return an array, but once I had the array I could not figure out how to unnest it. Issue https://github.com/apache/arrow-datafusion/issues/212 looks to cover that, but I think it might be better to just have the function return multiple rows directly.

Additional context Apologies if I missed something that is already available for this.

jimexist commented 2 years ago

@lightjacket in my opinion a better idea is to return a list in one row?

lightjacket commented 2 years ago

@Jimexist thanks for the suggestion! I did consider that as an alternative but would need to unnest the result in order to be able to join against additional tables like so: Set-returning

SELECT * FROM 
(SELECT inverted_index_search('some query') as item_id) t
LEFT JOIN additional_data ON t.item_id = additional_data.item_id;

Single row w/ a list

SELECT * FROM 
(SELECT unnest(inverted_index_search('some query')) as item_id) t
LEFT JOIN additional_data ON t.item_id = additional_data.item_id;

So if the unnest function existed, I would be unblocked. I think if I could create a set-returning UDF I could also implement unnest myself as a user and wouldn't need to wait on a built-in unnest.

If there is an alternative to unnest to get that join I am looking for though, that would absolutely work. I just haven't found it yet if it exists.

jimexist commented 2 years ago

let's add unnest as it is more general:

lightjacket commented 2 years ago

I'd argue that unnest is less general (I'd think unnest would be a specific implementation of a set-returning UDF), but the unnest function certainly solves my use-case so if that's the preference, it certainly works for me. Thank you!