domoritz / arrow-tools

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet
Apache License 2.0
149 stars 8 forks source link

Add example to documentation #57

Closed felix-hh closed 11 months ago

felix-hh commented 11 months ago

Hi there!

I was streaming very large files from curl, converting them to parquet and sending them back to s3 via Linux piping (without storing anything on disk) and I had some trouble figuring out how to pipe input through these tools.

For example, I was doing cat file | json2parquet >> out and I got an error, or cat file | json2parquet - myout.parquet and I got an error.

Eventually I figured out that what you need to do is cat file | json2parquet /dev/stdin /dev/stdout | gzip -c >> myparquet.parquet.gz. I figured this out by doing a deep dive into previous commits and issues like #3 . An example would have been a great time saver!

In any case, I was thinking of adding that adding a small use case example to the documentation would be useful for people who are not so familiar with piping in linux like me. It took me 30 minutes to figure it out although looking back is just elementary linux knowledge.

Happy to contribute the changes myself if the maintainers are on board!

Cheers, Felix

domoritz commented 11 months ago

Absolutely. There are already some examples at https://github.com/domoritz/arrow-tools/tree/main/crates/csv2parquet#examples but we should link to that from the main readme and add examples that use stdin. I think it would also be good to improve the API around using stdin so you don't have to say it explicitly. Contributions are very welcome.

felix-hh commented 11 months ago

Just submitted a PR, feel free to give feedback or modify as desired. I chose to reference the examples section of csv2parquet from every other tool as the interfaces are very similar, and most people like me won't think of checking csv2parquet to learn how to use json2parquet.

Improving the API sounds like a good idea, if people have the bandwidth! I have a use case that is more critical for me at this point and I'll point it out in a separate issue.