canonical / charmed-kubeflow-uats

Automated UATs for Charmed Kubeflow
Apache License 2.0
6 stars 2 forks source link

`kserve-integration` notebook fails on self-hosted runners with "Name or service not known" #47

Closed orfeas-k closed 10 months ago

orfeas-k commented 10 months ago

Running the kserve-integration UAT notebook from main branch on a self-hosted runner fails with the following traceback of HTTPConnection errors:

491 E           NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known
...
525 E           MaxRetryError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
...
581 E           ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8da4e75ac0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

The Name or service not known probably means that something failed during DNS resolution

Environment

Logs

```bash ______________________ test_notebook[kserve-integration] _______________________ test_notebook = '/tests/notebooks/kserve/kserve-integration.ipynb' @pytest.mark.ipynb @pytest.mark.parametrize( # notebook - ipynb file to execute "test_notebook", NOTEBOOKS.values(), ids=NOTEBOOKS.keys(), ) def test_notebook(test_notebook): """Test Notebook Generic Wrapper.""" os.chdir(os.path.dirname(test_notebook)) with open(test_notebook) as nb: notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT) ep = ExecutePreprocessor( timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements ) ep.skip_cells_with_tag = "pytest-skip" try: log.info(f"Running {os.path.basename(test_notebook)}...") > output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}}) /tests/test_notebooks.py:45: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:100: in preprocess self.preprocess_cell(cell, resources, index) /opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:121: in preprocess_cell cell = self.execute_cell(cell, index, store_history=True) /opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py:166: in wrapped return loop.run_until_complete(inner) /opt/conda/lib/python3.8/asyncio/base_events.py:616: in run_until_complete return future.result() /opt/conda/lib/python3.8/site-packages/nbclient/client.py:1021: in async_execute_cell await self._check_raise_for_error(cell, cell_index, exec_reply) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = cell = {'cell_type': 'code', 'execution_count': 9, 'id': '4ef27af2-9ae0-4adf-9058-ecc5ac84ef24', 'metadata': {'execution': {'...}\nresponse = requests.post(f"{isvc_url}/v1/models/sklearn-iris:predict", json=inference_input)\nprint(response.text)'} cell_index = 18 exec_reply = {'buffers': [], 'content': {'ename': 'ConnectionError', 'engine_info': {'engine_id': -1, 'engine_uuid': 'a9889439-8e38...e, 'engine': 'a9889439-8e38-4cd4-91d1-fed3131c0170', 'started': '2023-11-14T11:40:36.221455Z', 'status': 'error'}, ...} async def _check_raise_for_error( self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict] ) -> None: if exec_reply is None: return None exec_reply_content = exec_reply['content'] if exec_reply_content['status'] != 'error': return None cell_allows_errors = (not self.force_raise_errors) and ( self.allow_errors or exec_reply_content.get('ename') in self.allow_error_names or "raises-exception" in cell.metadata.get("tags", []) ) await run_hook( self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply ) if not cell_allows_errors: > raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content) E nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell: E ------------------ E inference_input = { E "instances": [ E [6.8, 2.8, 4.8, 1.4], E [6.0, 3.4, 4.5, 1.6] E ] E } E response = requests.post(f"{isvc_url}/v1/models/sklearn-iris:predict", json=inference_input) E print(response.text) E ------------------ E E ---------------------------------------------------------------------------E gaierror Traceback (most recent call last) E File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self) E 173 try: E --> 174 conn = connection.create_connection( E 175 (self._dns_host,self.port),self.timeout,**extra_kw E 176 ) E 178 except SocketTimeout: E E File /opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py:72, in create_connection(address, timeout, source_address, socket_options) E 68 return six.raise_from( E 69 LocationParseError(u"'%s', label empty or too long" % host), None E 70 ) E ---> 72 for res in socket.getaddrinfo(host,port,family,socket.SOCK_STREAM): E 73 af, socktype, proto, canonname, sa = res E E File /opt/conda/lib/python3.8/socket.py:918, in getaddrinfo(host, port, family, type, proto, flags) E 917 addrlist = [] E --> 918 for res in _socket.getaddrinfo(host,port,family,type,proto,flags): E 919 af, socktype, proto, canonname, sa = res E E gaierror: [Errno -2] Name or service not known E E During handling of the above exception, another exception occurred: E E NewConnectionError Traceback (most recent call last) E File /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py:714, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) E 713 # Make the request on the httplib connection object. E --> 714 httplib_response = self._make_request( E 715 conn, E 716 method, E 717 url, E 718 timeout=timeout_obj, E 719 body=body, E 720 headers=headers, E 721 chunked=chunked, E 722 ) E 724 # If we're going to release the connection in ``finally:``, then E 725 # the response doesn't need to know about the connection. Otherwise E 726 # it will also try to release it and we'll have a double-release E 727 # mess. E E File /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py:415, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) E [414](https://github.com/canonical/bundle-kubeflow/actions/runs/6862655392/job/18660833674#step:15:415) else: E --> 415 conn.request(method,url,**httplib_request_kw) E 417 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is E 418 # legitimately able to close the connection after sending a valid response. E 419 # With this behaviour, the received response is still readable. E E File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:244, in HTTPConnection.request(self, method, url, body, headers) E 243 headers["User-Agent"] = _get_default_user_agent() E --> 244 super(HTTPConnection,self).request(method,url,body=body,headers=headers) E E File /opt/conda/lib/python3.8/http/client.py:1252, in HTTPConnection.request(self, method, url, body, headers, encode_chunked) E 1251 """Send a complete request to the server.""" E -> 1252 self._send_request(method,url,body,headers,encode_chunked) E E File /opt/conda/lib/python3.8/http/client.py:1298, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked) E 1297 body = _encode(body, 'body') E -> 1298 self.endheaders(body,encode_chunked=encode_chunked) E E File /opt/conda/lib/python3.8/http/client.py:1247, in HTTPConnection.endheaders(self, message_body, encode_chunked) E 1246 raise CannotSendHeader() E -> 1247 self._send_output(message_body,encode_chunked=encode_chunked) E E File /opt/conda/lib/python3.8/http/client.py:1007, in HTTPConnection._send_output(self, message_body, encode_chunked) E 1006 del self._buffer[:] E -> 1007 self.send(msg) E 1009 if message_body is not None: E 1010 E 1011 # create a consistent interface to message_body E E File /opt/conda/lib/python3.8/http/client.py:947, in HTTPConnection.send(self, data) E 946 if self.auto_open: E --> 947 self.connect() E 948 else: E E File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:205, in HTTPConnection.connect(self) E 204 def connect(self): E --> 205 conn = self._new_conn() E 206 self._prepare_conn(conn) E E File /opt/conda/lib/python3.8/site-packages/urllib3/connection.py:186, in HTTPConnection._new_conn(self) E 185 except SocketError as e: E --> 186 raise NewConnectionError( E 187 self, "Failed to establish a new connection: %s" % e E 188 ) E 190 return conn E E NewConnectionError: : Failed to establish a new connection: [Errno -2] Name or service not known E E During handling of the above exception, another exception occurred: E E MaxRetryError Traceback (most recent call last) E File /opt/conda/lib/python3.8/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies) E 485 try: E --> 486 resp = conn.urlopen( E 487 method=request.method, E 488 url=url, E 489 body=request.body, E 490 headers=request.headers, E 491 redirect=False, E 492 assert_same_host=False, E 493 preload_content=False, E 494 decode_content=False, E 495 retries=self.max_retries, E 496 timeout=timeout, E 497 chunked=chunked, E 498 ) E 500 except (ProtocolError, OSError) as err: E E File /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py:798, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) E 796 e = ProtocolError("Connection aborted.", e) E --> 798 retries = retries.increment( E 799 method,url,error=e,_pool=self,_stacktrace=sys.exc_info()[2] E 800 ) E 801 retries.sleep() E E File /opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace) E 591 if new_retry.is_exhausted(): E --> 592 raise MaxRetryError(_pool, url, error or ResponseError(cause)) E 594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry) E E MaxRetryError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known')) E E During handling of the above exception, another exception occurred: E E ConnectionError Traceback (most recent call last) E Cell In[9], line 7 E 1 inference_input = { E 2 "instances": [ E 3 [6.8, 2.8, 4.8, 1.4], E 4 [6.0, 3.4, 4.5, 1.6] E 5 ] E 6 } E ----> 7 response = requests.post(f"{isvc_url}/v1/models/sklearn-iris:predict",json=inference_input) E 8 print(response.text) E E File /opt/conda/lib/python3.8/site-packages/requests/api.py:115, in post(url, data, json, **kwargs) E 103 def post(url, data=None, json=None, **kwargs): E 104 r"""Sends a POST request. E 105 E 106 :param url: URL for the new :class:`Request` object. E (...) E 112 :rtype: requests.Response E 113 """ E --> 115 return request("post",url,data=data,json=json,**kwargs) E E File /opt/conda/lib/python3.8/site-packages/requests/api.py:59, in request(method, url, **kwargs) E 55 # By using the 'with' statement we are sure the session is closed, thus we E 56 # avoid leaving sockets open which can trigger a ResourceWarning in some E 57 # cases, and look like a memory leak in others. E 58 with sessions.Session() as session: E ---> 59 return session.request(method=method,url=url,**kwargs) E E File /opt/conda/lib/python3.8/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) E 584 send_kwargs = { E 585 "timeout": timeout, E 586 "allow_redirects": allow_redirects, E 587 } E 588 send_kwargs.update(settings) E --> 589 resp = self.send(prep,**send_kwargs) E 591 return resp E E File /opt/conda/lib/python3.8/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs) E 700 start = preferred_clock() E 702 # Send the request E --> 703 r = adapter.send(request,**kwargs) E 705 # Total elapsed time of the request (approximately) E 706 elapsed = preferred_clock() - start E E File /opt/conda/lib/python3.8/site-packages/requests/adapters.py:519, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies) E 515 if isinstance(e.reason, _SSLError): E 516 # This branch is for urllib3 v1.22 and later. E 517 raise SSLError(e, request=request) E --> 519 raise ConnectionError(e, request=request) E 521 except ClosedPoolError as e: E 522 raise ConnectionError(e, request=request) E E ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known')) E ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known')) /opt/conda/lib/python3.8/site-packages/nbclient/client.py:915: CellExecutionError During handling of the above exception, another exception occurred: test_notebook = '/tests/notebooks/kserve/kserve-integration.ipynb' @pytest.mark.ipynb @pytest.mark.parametrize( # notebook - ipynb file to execute "test_notebook", NOTEBOOKS.values(), ids=NOTEBOOKS.keys(), ) def test_notebook(test_notebook): """Test Notebook Generic Wrapper.""" os.chdir(os.path.dirname(test_notebook)) with open(test_notebook) as nb: notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT) ep = ExecutePreprocessor( timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements ) ep.skip_cells_with_tag = "pytest-skip" try: log.info(f"Running {os.path.basename(test_notebook)}...") output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}}) # persist the notebook output to the original file for debugging purposes save_notebook(output_notebook, test_notebook) except CellExecutionError as e: # handle underlying error > pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}") E Failed: Notebook execution failed with ConnectionError: HTTPConnectionPool(host='sklearn-iris.test-kubeflow.svc.cluster.local', port=80): Max retries exceeded with url: /v1/models/sklearn-iris:predict (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known')) /tests/test_notebooks.py:50: Failed ... ----------------------------- Captured stderr call ----------------------------- ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. scipy 1.7.0 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.24.4 which is incompatible. kubeflow-katib 0.15.0 requires grpcio==1.41.1, but you have grpcio 1.51.3 which is incompatible. kubeflow-katib 0.15.0 requires protobuf==3.19.5, but you have protobuf 3.20.3 which is incompatible. kfp 1.8.22 requires kubernetes<26,>=8.0.0, but you have kubernetes 28.1.0 which is incompatible. jupyter-server 1.23.6 requires anyio<4,>=3.1.0, but you have anyio 4.0.0 which is incompatible. ```
orfeas-k commented 10 months ago

Interesting finding is that the kserve-integration UAT from main branch PASSED when deployed latest/edge bundle.yaml file. Will rerun UATs from track/1.7 on bundle 1.7/stable. The only difference between the UATs branches is this bugfix PR, but when upgrading from Kserve 0.10 -> 0.11, we also switched from raw-deployment mode to serverless, which could affect the K8s services created by Kserve.

orfeas-k commented 10 months ago

As noted in https://github.com/canonical/kserve-operators/issues/148 and in the notebook's PR https://github.com/canonical/charmed-kubeflow-uats/pull/10

This only works with Serverless deployment mode at the moment

Thus, the above behavior is expected since we ran this with CKF 1.7 that deploys Kserve in RawDeployment mode. We will close this issue but we will need to investigate more https://github.com/canonical/kserve-operators/issues/148 in order to understand:

  1. Why isn't a K8s Service created in the first place in RawDeployment
  2. Whether we should stick with RawDeployment or switch to Serverless in 1.7 as well, to get KServe working
orfeas-k commented 10 months ago

Reopening. We will keep this open until we update the kserve-integration UAT notebook with a requirement-note that this works only serverless mode of Kserve.