JohnPaton / airbase

🌬 An easy downloader for the AirBase air quality data.
https://airbase.readthedocs.io
MIT License
9 stars 4 forks source link

Support additional output types #31

Open JohnPaton opened 2 years ago

JohnPaton commented 2 years ago

Right now we only support CSV, which is what the portal provides. We could convert to other file formats (parquet, avro) on the fly for easier processing later.

avaldebe commented 2 years ago

Hi @JohnPaton

At work we plan to extend the CLI to handle parquet files. Are you interested on a PR?

It would be a post processing CLI script, something like airbase-to-parquet data-path parquet-path, with an command line option to define a partition for the dataset.

We'll base our work from #38, as Poetry allows console scripts that depends on an extra. That way pip install airbase[parquet] will install the required dependencies and the new CLI.

We could further integrate with the existing CLI and download utilities, but we can discuss the details on the eventual PR...

JohnPaton commented 2 years ago

Hey, I think more output formats would be great and parquet is an obvious choice, though I guess we'll need to make some smart choices about partitioning

JohnPaton commented 2 years ago

Maybe we could start a separate module for postprocessing, and add to the CLI in a followup PR?

avaldebe commented 2 years ago

Sure. This can be done using a plugin architecture that allows to add post processing formats on a different package by declaring entry points. I have done something like this on two projects before. Will prepare a draft PR to illustrate the methodology.

JohnPaton commented 2 years ago

Alright, I have no experience in this direction so I'm happy to see what you come up with!