Open alamb opened 2 months ago
It would be really cool if you could inspect parquet metadata with this
This seems like a really fun thing to work on. If no one else is working on it I'd love to take it on. I really like the metadata idea.
Contributions very welcome :)
IMO it would be best to first integrate the DynamicFileProvider
in one PR and then do a follow on for metadata.
I also just realized that their may be some existing metadata functionality for datafusion although its not clear to me from the docs whether that is only in datafusion-cli or a built-in function to datafusion that we could also use. If it was specific to datafusion cli then it would be great if we could add that.
Also you could potentially get some inspiration for additional metadata capabilities from duckdb
BTW we can take inspiration from / copy outright the parquet_metadata
table function for parquet from datafusion-cli (also modeled on duckdb):
https://datafusion.apache.org/user-guide/cli/usage.html#parquet-metadata
I would like to suggest creating those functions in their own crate (perhaps datafusion-functions-parquet
?) -- it could be in the datafusion-dft repo initially for convenience, but I think eventually the goal should be that dft
just be focused on integration rather than actually implementing such features.
In fact maybe once dft gets good enough we could remove the parquet_metadata
function from datafusion-cli
entirely 🤔
I also just realized that their may be some existing metadata functionality for datafusion although its not clear to me from the docs whether that is only in datafusion-cli or a built-in function to datafusion that we could also use. If it was specific to datafusion cli then it would be great if we could add that.
Sorry I missed this -- it is only in datafusion-cli
Implementation is here: https://github.com/apache/datafusion/blob/257e1409eca81cfff024ecc5e2567e9f67e6b5a3/datafusion-cli/src/functions.rs#L317-L459
I suggest we file a second ticket for implementing parquet_metadata and other duckdb metadata functions
i agree with putting it in its own crate. like @alamb said i also think that dft
could be used as an incubator of sorts. For example I have taken that approach in my WASM function factory PR. I have no intention of keeping that in this repo but its quite convenient for the time while it matures.
i agree with putting it in its own crate. like @alamb said i also think that
dft
could be used as an incubator of sorts. For example I have taken that approach in my WASM function factory PR. I have no intention of keeping that in this repo but its quite convenient for the time while it matures.
Sounds good -- I'll go ahead and get these both assigned to myself and then start cracking on it in the next few days :)
take
edit: looks like github actions is not set up to auto-assign like datafusion 😅
take
edit: looks like github actions is not set up to auto-assign like datafusion 😅
How about you take https://github.com/datafusion-contrib/datafusion-dft/issues/148 and I'll try this one? DataFusion 42 was just released and I very much want this particular feature in dft
I want queries like this to work:
This works great in datafusion-cli:
It currently doesn't in
dft
Once datafusion 42.0.0 is released, we can likely use the
DynamicFileProvider
that @goldmedal added in https://github.com/apache/datafusion/pull/11035