Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

fix job trackers in Nifi #719

Closed bossie closed 2 months ago

bossie commented 3 months ago

Using non-F5 rate limited EJR API endpoint broke the job trackers in Nifi:

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib64/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/lib64/python3.8/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/usr/lib64/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
socket.timeout: _ssl.c:1108: The handshake operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 407, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 358, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='jobregistry.vgt.vito.be', port=443): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 827, in urlopen
    return self.urlopen(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 827, in urlopen
    return self.urlopen(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 827, in urlopen
    return self.urlopen(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/opt/venv/lib64/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='jobregistry.vgt.vito.be', port=443): Max retries exceeded with url: /health (Caused by ReadTimeoutError("HTTPSConnectionPool(host='jobregistry.vgt.vito.be', port=443): Read timed out. (read timeout=20)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 629, in main
    elastic_job_registry = get_elastic_job_registry(requests_session) if config.ejr_api else None
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1738, in get_elastic_job_registry
    job_registry.health_check(log=True)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 346, in health_check
    response = self._do_request("GET", "/health", use_auth=use_auth)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 320, in _do_request
    response = self._session.request(
  File "/opt/venv/lib64/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/venv/lib64/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/venv/lib64/python3.8/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='jobregistry.vgt.vito.be', port=443): Max retries exceeded with url: /health (Caused by ReadTimeoutError("HTTPSConnectionPool(host='jobregistry.vgt.vito.be', port=443): Read timed out. (read timeout=20)"))
bossie commented 3 months ago

Quick-fixed by pointing them to an internal endpoint instead.

This endpoint is not load-balanced however so I'll have to file a WRT to make the public endpoint accessible from (Docker containers on) the Nifi hosts.

bossie commented 3 months ago

Internal ref: WRT-4369 ~GDD-3189~

bossie commented 3 months ago

JobTracker processors in Nifi make use of parameter openeo_ejr_api for easy transition.

bossie commented 2 months ago

Seems to work but Internal ticket status unclear ("temporary measure", "to verify"), awaiting ticket update.

bossie commented 2 months ago

Seems to work on host nifi-prod-0x machines with temporary measure in place (= adapt /etc/hosts) so not sure how temporary this is.

This is also required in a Docker container:

[vdboschj@nifi-prod-02 ~]$ sudo docker run --rm --entrypoint curl --add-host jobregistry.vgt.vito.be:192.168.201.16 vito-docker-private.artifactory.vgt.vito.be/openeo-yarn:20240327-2502 https://jobregistry.vgt.vito.be/health
bossie commented 2 months ago

DNS was updated, will switch to https://jobregistry.vgt.vito.be after TTL passes (24h) and everything still works.

bossie commented 2 months ago

Changed EJR endpoint on Nifi from http://docker-services-prod-01.vgt.vito.be:3015 to https://jobregistry.vgt.vito.be, still good. :+1: