earthobservations / luftdatenpumpe

Acquire and process live and historical air quality data without efforts. Filter by station-id, sensor-id and sensor-type, apply reverse geocoding, store into time-series and RDBMS databases, publish to MQTT, output as JSON, or visualize in Grafana. Data sources: Sensor.Community (luftdaten.info), IRCELINE, and OpenAQ.
https://luftdatenpumpe.readthedocs.io/
GNU Affero General Public License v3.0
35 stars 3 forks source link

[LDI] How to acquire and process historical data for further analysis in R? #9

Open amotl opened 5 years ago

amotl commented 5 years ago

One of our colleagues would like to download the data (since somewhere beginning of February) of 8 LDI-sensors next to official stations for further analysis in e.g. R.

Making a dashboard by using the specific LDI station identifiers is easy but is there actually some download functionality?

amotl commented 5 years ago

Introduction

For acquiring observations from specific stations, you can use luftdatenpumpe to generate a JSON file which might be processed by other tools in the downstream/analysis pipeline.

Ad hoc example

We want to outline a basic example here. The output of the command below is available at LDI_BE_7013_10725_13585_2019-09-27T210801Z.json to get an idea about what this could do for you.

luftdatenpumpe readings --network=ldi --reverse-geocode --station=7013,10725,13585 > 'LDI_BE_7013_10725_13585_2019-09-27T210801Z.json'

Historical data example

There's a section named LDI CSV archive data examples (InfluxDB) within luftdatenpumpe --help.

In short, you will have to download the historical data first by invoking

wget --mirror --continue --no-host-directories --directory-prefix=/var/spool/archive.luftdaten.info --accept-regex='2019-0[2-9]' http://archive.luftdaten.info/

and then process this data by invoking

luftdatenpumpe readings --network=ldi --station=7013,10725,13585 --source=file:///var/spool/archive.luftdaten.info
amotl commented 5 years ago

Reading Parquet files from R

Please also note that there are by-sensor Parquet files available at http://archive.luftdaten.info/parquet/. While luftdatenpumpe does not have an option for ingesting them, we definitively would like to add that as an improvement.

Nevertheless, it might already be a better option for wrangling with the data directly in R without using luftdatenpumpe at all. You would either use the R package for Arrow to access the data files through Arrow or one of the R packages for accessing the Spark analytics engine to read the files through its machinery.

The discussion at [1] outlines different ways of accessing Parquet files from R.

[1] https://stackoverflow.com/questions/30402253/how-do-i-read-a-parquet-in-r-and-convert-it-to-an-r-dataframe

amotl commented 4 years ago

We just found this module which could fill the gap between Python and R.

rpy2 is an interface to R running embedded in a Python process.

-- https://rpy2.bitbucket.io/

See also https://code.likeagirl.io/walking-the-python-r-bridge-66b63bab0fbd.