earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
358 stars 55 forks source link

Exporting a reasonable amount of data to InfluxDB fails #235

Closed amotl closed 3 years ago

amotl commented 3 years ago

Describe the bug When trying to export some more data to InfluxDB, Wetterdienst croaks. Thanks for reporting this, @wetterfrosch!

To reproduce

wetterdienst dwd readings \
    --parameter=air_temperature --resolution=hourly --period=recent \
    --latitude=52.5 --longitude=13.4 --distance=200 \
    --target="influxdb://localhost:8086/?database=dwd&table=weather" \
    --tidy

Full traceback

Traceback (most recent call last):
  File "/Users/amo/Library/Caches/pypoetry/virtualenvs/wetterdienst-EkOFQaO8-py3.8/bin/wetterdienst", line 33, in <module>
    sys.exit(load_entry_point('wetterdienst', 'console_scripts', 'wetterdienst')())
  File "/Users/amo/dev/earthobservations/wetterdienst/wetterdienst/cli.py", line 281, in run
    df.io.export(options.target)
  File "/Users/amo/dev/earthobservations/wetterdienst/wetterdienst/util/pandas.py", line 172, in export
    c.write_points(
  File "/Users/amo/Library/Caches/pypoetry/virtualenvs/wetterdienst-EkOFQaO8-py3.8/lib/python3.8/site-packages/influxdb/_dataframe_client.py", line 133, in write_points
    super(DataFrameClient, self).write_points(
  File "/Users/amo/Library/Caches/pypoetry/virtualenvs/wetterdienst-EkOFQaO8-py3.8/lib/python3.8/site-packages/influxdb/client.py", line 594, in write_points
    return self._write_points(points=points,
  File "/Users/amo/Library/Caches/pypoetry/virtualenvs/wetterdienst-EkOFQaO8-py3.8/lib/python3.8/site-packages/influxdb/client.py", line 672, in _write_points
    self.write(
  File "/Users/amo/Library/Caches/pypoetry/virtualenvs/wetterdienst-EkOFQaO8-py3.8/lib/python3.8/site-packages/influxdb/client.py", line 404, in write
    self.request(
  File "/Users/amo/Library/Caches/pypoetry/virtualenvs/wetterdienst-EkOFQaO8-py3.8/lib/python3.8/site-packages/influxdb/client.py", line 369, in request
    raise InfluxDBClientError(err_msg, response.status_code)
influxdb.exceptions.InfluxDBClientError: 413: {"error":"Request Entity Too Large"}
amotl commented 3 years ago

@wetterfrosch suggested:

From doing my own work on ingesting DWD data to InfluxDB the other day, I remember that it doesn't like to see more than 20k lines of lineprotocol at once, at least that's what the documentation said last year.

Back then, I saved all lines into one file and splitted them into chunks of 20k lines each, using the awesome Unix command split. Then, I submitted them to InfluxDB consecutively.

amotl commented 3 years ago

I can confirm the command outlined above yields more than 20k data points.

wetterdienst dwd readings \
    --parameter=air_temperature --resolution=hourly --period=recent \
    --latitude=52.5 --longitude=13.4 --distance=200 --tidy | jq length

1742052

Even when not using --tidy, the number of data points is still 871026 and also yields the Request Entity Too Large error when trying to export them to InfluxDB. Trying this needs the fix coming from #237.

amotl commented 3 years ago

I can't confirm that the limit is based on the number of data points, at least the limit is not 20k. When truncating the Pandas DataFrame using df = df[:270000], the write operation still succeeds.

When going beyond that by truncating to 272k data points using df = df[:272000], the write operation croaks again yielding the Request Entity Too Large error. So, maybe this is actually based on some size limit for the HTTP request body?

amotl commented 3 years ago

Fortunately, the DataFrameClient's write_points() method offers a batch_size parameter [1]. When configuring this to be like batch_size=20000, the whole operation of writing 1.7 million data points succeeds within ~50 seconds. Using batch_size=100000 takes roughly the same amount of time.

245 addressed this issue. Thanks again for reporting this, @wetterfrosch!

[1] https://influxdb-python.readthedocs.io/en/latest/api-documentation.html#influxdb.DataFrameClient.write_points