Open wj-stack opened 2 months ago
I agree that that an example for merging dataframes from different table providers and finding distinct values from there would be a nice addition. Currently there is union but the schema's must be identical for that to work iirc. (See https://github.com/apache/datafusion/issues/12650 for an enhancement request to address that)
There is distinct_on available in the dataframe api which is just a wrapper for the LogicalPlanBuilder::distinct_on fn.
Is your feature request related to a problem or challenge?
I currently have two data sources, one stored in Parquet format and the other in memory. I need to implement a scan function. I tried using UnionExec, but it's obviously not working, especially when using aggregation functions like count. Maybe I should use SortPreservingMergeExec, but there are too few examples of this function. I would appreciate it if you could add an example that includes multiple data sources, as these sources may contain duplicate data, and I would be happy to see an example of deduplication based on multiple fields.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response