Deltares / ddlpy

API to Dutch Rijkswaterstaat archive (DDL, waterinfo.rws.nl) of monitoring water data
https://deltares.github.io/ddlpy/
GNU General Public License v3.0
20 stars 6 forks source link

add `max_retries` for `requests` #101

Open veenstrajelmer opened 6 months ago

veenstrajelmer commented 6 months ago

Description

Sometimes in the middle of data retrieval, the connection is aborted from the server side. This is an error that cannot be reproduced (and forgot to copy the traceback), but very inconvenient since it interrupts the download process.

Suggestion

Add max_retries parameter for requests to improve robustness of ddlpy.

import logging
import requests

from requests.adapters import HTTPAdapter, Retry

logging.basicConfig(level=logging.DEBUG)

s = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))

s.get("http://httpstat.us/503")
Weidav commented 5 months ago

I think I'm getting the same error here, still on 0.4.0 though. Here's my traceback:

Traceback (most recent call last):
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connectionpool.py", line 492, in _make_request
    raise new_e
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    self._validate_conn(conn)
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1097, in _validate_conn
    conn.connect()
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connection.py", line 611, in connect
    self.sock = sock = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connection.py", line 212, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7ffa8455ab10>, 'Connection to waterwebservices.rijkswaterstaat.nl timed out. (connect timeout=None)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/connectionpool.py", line 845, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='waterwebservices.rijkswaterstaat.nl', port=443): Max retries exceeded with url: /ONLINEWAARNEMINGENSERVICES_DBO/OphalenWaarnemingen (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ffa8455ab10>, 'Connection to waterwebservices.rijkswaterstaat.nl timed out. (connect timeout=None)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/rws/rwsload.py", line 454, in <module>
    dsn=sentry_dsn,
^^^^^^
  File "~/rws/rwsload.py", line 436, in main
    insertion_status = ReportsInsertionService.process_report(session=session, reports_data=result)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "~/rws/rwsload.py", line 115, in fetch_data
    except JSONDecodeError:
                       ^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/ddlpy/ddlpy.py", line 357, in measurements
    measurement = _measurements_slice(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/ddlpy/ddlpy.py", line 301, in _measurements_slice
    resp = requests.post(endpoint["url"], json=request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/virtualenvs/weatherdata/lib/python3.11/site-packages/requests/adapters.py", line 507, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='waterwebservices.rijkswaterstaat.nl', port=443): Max retries exceeded with url: /ONLINEWAARNEMINGENSERVICES_DBO/OphalenWaarnemingen (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ffa8455ab10>, 'Connection to waterwebservices.rijkswaterstaat.nl timed out. (connect timeout=None)'))
veenstrajelmer commented 5 months ago

@Weidav that could be the case, fixing this issue would prevent your process from being interrupted if there is a single timeout. There could of course also be a outage of the rijkswaterstaat server, in which case the process will fail either way. However, it is difficult to fix this problem, since we have no way to simulate a single timeout on the server side, so it is difficult to debug. This is also a nice to have feature, not as essential as the recently implemented developments. If you run into this issue again, please include a minimal example code to reproduce it, if it can be reproduced at least.

Weidav commented 5 months ago

This keeps happening on a regular basis.

I use the selected_stations.csv to store the result from ddlpy.locations() because that endpoint causes issues on a regular basis. This approach with the is more stable an allows my to directly fetch the measurements. I tried to update the csv-file, I thought maybe the stations and their available parameters changed, but that didn't help.

Here's a little snipped from my code:

EDIT: updated the csv again and I'm that leads to fewer exeptions with mesurements, I'll keep you updated.

    selected_stations = pandas.read_csv("selected_stations.csv", index_col=0)

    # measurements-timezone is always in utc+1
    one_h_ago = datetime.utcnow() - timedelta(hours=2.1)
    tomorrow = datetime.utcnow() + timedelta(days=1, hours=1)

    # iterate over my known spots
    for rws_id, spot_id in spots_dict.items():
        try:
            station = selected_stations.loc[rws_id]
        except KeyError:
            logger.info(f"spot-id: {spot_id} source_station-id: {rws_id} has no measurements")
            continue

        # when a station has only one entry, it is usually incomplete and stored as a series
        if type(station) is pandas.core.series.Series:
            logger.debug(f"{spot_id} measurements are incomplete and will be ignored")

        i = 0
        # iterate over the the different measurement-types (wind, waves...) from this station
        for index, station_data in station.iterrows():
            try:
                measurements = ddlpy.measurements(
                    station_data, start_date=one_h_ago, end_date=tomorrow
                )
            except JSONDecodeError:
                continue
    [...]
Weidav commented 5 months ago

Update: I keep running into the same issues, even with up to date locations / csv-file.

veenstrajelmer commented 5 months ago

Could you provide example code to reproduce the issue without any of your own files or local code? So a minimal code only requiring ddlpy and its dependencies.