Open JohnPaton opened 2 years ago
Hi @JohnPaton
At work we plan to extend the CLI to handle parquet files. Are you interested on a PR?
It would be a post processing CLI script, something like airbase-to-parquet data-path parquet-path
, with an command line option to define a partition for the dataset.
We'll base our work from #38, as Poetry allows console scripts that depends on an extra. That way pip install airbase[parquet]
will install the required dependencies and the new CLI.
We could further integrate with the existing CLI and download utilities, but we can discuss the details on the eventual PR...
Hey, I think more output formats would be great and parquet is an obvious choice, though I guess we'll need to make some smart choices about partitioning
Maybe we could start a separate module for postprocessing, and add to the CLI in a followup PR?
Sure. This can be done using a plugin architecture that allows to add post processing formats on a different package by declaring entry points. I have done something like this on two projects before. Will prepare a draft PR to illustrate the methodology.
Alright, I have no experience in this direction so I'm happy to see what you come up with!
Right now we only support CSV, which is what the portal provides. We could convert to other file formats (parquet, avro) on the fly for easier processing later.