earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
358 stars 55 forks source link

InfluxDB ingest fails when using the non-tidy data shape #230

Closed wetterfrosch closed 3 years ago

wetterfrosch commented 3 years ago

Describe the bug Can't push CDC data to InfluxDB. Tested with hourly and 10_minutes resolutions of air_temperature. JSON export looks fine.

To Reproduce Using InfluxDB v1.7.9.

$ wetterdienst --version
wetterdienst 0.10.1

installiert mittels

$ sudo pip install git+https://github.com/earthobservations/wetterdienst
$ wetterdienst dwd readings --parameter=air_temperature --resolution=10_minutes --period=recent --station=1048,4411 --date=2020-10/2020-11 --target="influxdb://localhost:8086/?database=dwd&table=weather"
2020-11-20 16:43:37,700 [wetterdienst.dwd.observations.api] INFO   : Acquiring observations data for air_temperature/10_minutes/recent/station_id_1048.
2020-11-20 16:43:38,160 [wetterdienst.dwd.observations.api] INFO   : Acquiring observations data for air_temperature/10_minutes/recent/station_id_4411.
2020-11-20 16:43:38,650 [wetterdienst.cli              ] INFO   : Writing data to target influxdb://localhost:8086/?database=dwd&table=weather
2020-11-20 16:43:38,651 [wetterdienst.util.pandas      ] INFO   : Writing to InfluxDB ('dwd', 'weather')
Traceback (most recent call last):
  File "/usr/bin/wetterdienst", line 10, in <module>
    sys.exit(run())
  File "/usr/lib/python3.8/site-packages/wetterdienst/cli.py", line 281, in run
    df.io.export(options.target)
  File "/usr/lib/python3.8/site-packages/wetterdienst/util/pandas.py", line 172, in export
    c.write_points(
  File "/usr/lib/python3.8/site-packages/influxdb/_dataframe_client.py", line 118, in write_points
    points = self._convert_dataframe_to_lines(
  File "/usr/lib/python3.8/site-packages/influxdb/_dataframe_client.py", line 390, in _convert_dataframe_to_lines
    tag_df = dataframe[tag_columns]
  File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 2912, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/usr/lib/python3.8/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/usr/lib/python3.8/site-packages/pandas/core/indexing.py", line 1304, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['element', 'parameter'] not in index"

Expected behavior A happy push of the data to InfluxDB.

Desktop (please complete the following information):

$ cat /proc/version 
Linux version 5.3.7-arch1-1-ARCH (builduser@heftig-2611250) (gcc version 9.2.0 (GCC)) #1 SMP PREEMPT Fri Oct 18 00:17:03 UTC 2019
amotl commented 3 years ago

Dear @wetterfrosch,

thanks for reporting this. We will have a look into the issue you are observing.

With kind regards, Andreas.

amotl commented 3 years ago

Just some notes while working on it and following your method of installing directly from the Git repository.

The full command to install the wetterdienst package including the optional influxdb "extra" is:

pip install git+https://github.com/earthobservations/wetterdienst#egg=wetterdienst[influxdb]

Otherwise, one would have to install this optional "extra package" manually.


Some packages have optional extras. You can tell pip to install these by specifying the extra in brackets.

-- https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/#installing-extras

wetterfrosch commented 3 years ago

The full command to install the wetterdienst package including the optional influxdb "extra" is:

pip install git+https://github.com/earthobservations/wetterdienst#egg=wetterdienst[influxdb]

Thank, I pip uninstalled wetterdienst an reinstalled it this way; error keeps on an is the same.

amotl commented 3 years ago

Investigation

I have been able to trace the error back to the optional "tidying" of the data introduced by @gutzbenj with https://github.com/earthobservations/wetterdienst/pull/118. While I introduced the command line option to toggle this data reshaping on/off with https://github.com/earthobservations/wetterdienst/pull/210 and also supported it with appropriate tests for both cases,

https://github.com/earthobservations/wetterdienst/blob/335952dce749be08a1ae6bc6ef9b9a7d2f30245f/tests/test_cli.py#L98-L111

I didn't test both options together with the InfluxDB export feature and unfortunately, we don't ride the unicorn of having full integration tests for this matter yet. So, this might well be an opportunity to introduce that into the mix in order to improve on the QA side.

Mitigation

When explicitly using the --tidy option, things start working again.

wetterdienst dwd readings --parameter=air_temperature --resolution=10_minutes --period=recent --station=1048,4411 --date=2020-10/2020-11 --target="influxdb://localhost:8086/?database=dwd&table=weather" --tidy
amotl commented 3 years ago

More details about the "tidy" data point shape

When explicitly using the --tidy option, things start working again.

Now, you might actually ask about the difference, right?

You can investigate what you will get from that by invoking the command with and without the --tidy option and look at the output to stdout.

wetterdienst dwd readings \
    --parameter=air_temperature --resolution=10_minutes --period=recent \
    --station=1048,4411 --date=2020-10/2020-11

vs.

wetterdienst dwd readings \
    --parameter=air_temperature --resolution=10_minutes --period=recent \
    --station=1048,4411 --date=2020-10/2020-11 \
    --tidy

In fact, this is a topic I already wanted to discuss with you. While the former invocation without --tidy will output the data in the accustomed tabular format, @gutzbenj had very good intentions on reshaping the data into the other format:

I added a "tidy_data" option that reshapes the returned DataFrame properly so that one line has now a datetime, value and corresponding quality flag. (https://github.com/earthobservations/wetterdienst/pull/118)

You might even appreciate this format, because it will yield the respective quality field attached to each single datapoint. You might know already that some datasets of DWD have multiple fields binding to different quality designations within the very same dataset. Providing data in this format clearly resolves that issue, so that each data point has the very same shape.

{
    "station_id":1048,
    "date":"2020-11-11T14:10:00.000Z",
    "parameter":"temperature_air",
    "element":"temperature_air_005",
    "value":6.1,
    "quality":3
}
wetterfrosch commented 3 years ago

When explicitly using the --tidy option, things start working again.

Wow, that just works this way!

amotl commented 3 years ago

You might even appreciate this format [because] each data point has the very same shape.

Now, we will be excited to hear if you also appreciate this format and if you will be able to make the fine Grafana dashboards work with that as good as before, or maybe even better.

If you still prefer the tabular format, please let me know. I will see whether I can provide a fix to be able to ingest data into InfluxDB using the non-tidy data shape as it worked before introducing that.

Let's keep this issue open until this is resolved and the data export feature will be able to handle both shapes again. Thanks!