Closed felix-hh closed 11 months ago
Absolutely. There are already some examples at https://github.com/domoritz/arrow-tools/tree/main/crates/csv2parquet#examples but we should link to that from the main readme and add examples that use stdin. I think it would also be good to improve the API around using stdin so you don't have to say it explicitly. Contributions are very welcome.
Just submitted a PR, feel free to give feedback or modify as desired. I chose to reference the examples section of csv2parquet from every other tool as the interfaces are very similar, and most people like me won't think of checking csv2parquet
to learn how to use json2parquet
.
Improving the API sounds like a good idea, if people have the bandwidth! I have a use case that is more critical for me at this point and I'll point it out in a separate issue.
Hi there!
I was streaming very large files from curl, converting them to parquet and sending them back to s3 via Linux piping (without storing anything on disk) and I had some trouble figuring out how to pipe input through these tools.
For example, I was doing
cat file | json2parquet >> out
and I got an error, orcat file | json2parquet - myout.parquet
and I got an error.Eventually I figured out that what you need to do is
cat file | json2parquet /dev/stdin /dev/stdout | gzip -c >> myparquet.parquet.gz
. I figured this out by doing a deep dive into previous commits and issues like #3 . An example would have been a great time saver!In any case, I was thinking of adding that adding a small use case example to the documentation would be useful for people who are not so familiar with piping in linux like me. It took me 30 minutes to figure it out although looking back is just elementary linux knowledge.
Happy to contribute the changes myself if the maintainers are on board!
Cheers, Felix