datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

CLI tools #62

Open Jefffrey opened 3 months ago

Jefffrey commented 3 months ago

Implement some CLI binaries for working with ORC files such as reading schema, getting stats, etc.

Tools to have:

Also need to ensure these are tested as part of CI

Some references:

klangner commented 2 months ago

Is this issue taken? I think it would be nice to have same examples how to use this library (or cli tools), since now it is not obvious how to use it. Maybe we could create list of tools/examples here (or as a separate issues?) so people could work on them. From my use case I'm interested in tools for:

I would also be interested in reading them from S3, but that probably not in this project?

I'm also willing to help with those implementations

waynexia commented 2 months ago

A similar one is parquet-tools, I used it several times when debugging with parquet files.

klangner commented 2 months ago

Yes, it looks nice. I can crate this cli tool (probably will start first with some simpler version first) if it fits this project and nobody is working on it yet.

Jefffrey commented 2 months ago

Hey @klangner I assigned it to myself initially because I did an initial commit as mentioned in the issue, but it isn't one of my priorities right now. Feel free to enhance the existing tool or add a separate one if you have a different use case :+1: