Add Parquet data export

earthobservations / wetterdienst

Open weather data for humans.

https://wetterdienst.readthedocs.io/

MIT License

349 stars 54 forks source link

Add Parquet data export #211

Closed amotl closed 3 years ago

amotl commented 3 years ago

About

I would like to be able to download historical or recent data for all stations and put them into either a database and/or into HDF5 files. Another option would be to produce Parquet, Arrow and Feather files.

With #158, Wetterdienst already gained a backplane for different postprocessing tasks in order to export data into different databases.

amotl commented 3 years ago

With 568824d, things like that will be possible:

# Export to Parquet file
wetterdienst dwd readings --station=all --parameter=kl --resolution=daily --period=recent --persist --target=file://recent-all.parquet

# Export to Feather file
wetterdienst dwd readings --station=all --parameter=kl --resolution=daily --period=recent --persist --target=file://recent-all.feather

amotl commented 3 years ago

For writing to Parquet files, we might want to use fastparquet instead of pyarrow. Would you also recommend that, @martindurant?

martindurant commented 3 years ago

fastparquet is less actively developed these days (mostly just fixes), but it is in some ways simpler and usually easier and smaller to install than arrow. Whatever works for you, i think. I would not use feather, which is not a standard for archiving, more of a convenience format.

amotl commented 3 years ago

Thanks for your feedback on this, Martin.

fastparquet is less actively developed these days.

Sorry to hear that. I believe the current implementation based on pyarrow is simple enough to follow and also integrates reasonably with Pandas.

https://github.com/earthobservations/wetterdienst/blob/568824dc50fb6d666090047cdda845bb79db362d/wetterdienst/util/pandas.py#L127-L133

I was just thinking about raw processing speed and wanted to ask whether fastparquet would offer something on this end.

I also recognized from the fastparquet documentation that the designated filename suffix for Parquet files should be .parq instead?

I would not use feather, which is not a standard for archiving, more of a convenience format.

Thanks. I believe Feather was popular within the R community? Do you believe these times are gone and everyone is (should) just (be) using Parquet these days?

martindurant commented 3 years ago

I was just thinking about raw processing speed and wanted to ask whether fastparquet would offer something on this end.

In some specific situations, fastparquet may be slightly faster; but on average, arrow is better performance.

filename suffix for Parquet files should be .parq

There is no standard. .parquet is also popular.

Thanks. I believe Feather was popular within the R community? Do you believe these times are gone and everyone is (should) just (be) using Parquet these days?

Parquet actually predates feather, and was designed for long-term storage in mind. Parquet was designed for cloud/cluster storage and parallel access - originally for the likes of hadoop, but taken up by spark and a whole load of big data tools as the de-facto standard.

amotl commented 3 years ago

Dear Martin,

all right, thanks again for sharing these insights. We will drop support for Feather then and will happily continue to name the Parquet files *.parquet.

With kind regards, Andreas.