flopach / digits-recognizer-kubeflow

Sample MLOps Workflow: Recognizing Digits with Kubeflow
MIT License
156 stars 101 forks source link

Unable to finish "get data batch" step in a run #4

Open jsitu777 opened 1 year ago

jsitu777 commented 1 year ago

Hi,

I was following your tutorial in https://www.youtube.com/watch?v=6wWdNg0GMV4 I have kubeflow set up with EKS cluster with version 1.23 (ebs-csi driver set up as instructed) Kubeflow itself seem to be working as I try the Demo XGBoost pipeline and it was able to complete.

I set up the notebook with allow access to kubeflow pipeline checked, and applied access_kfp_from_jupyter_notebook.yaml and set-minio-kserve-secret.yaml

I am also able to access minio and see some artifacts generated etc.

When I run digits_recognizer_pipeline.ipynb, the get latest data finished quickly but the get data batch step get stucked and time out.

Here;s the log:

time="2023-02-21T23:47:25.234Z" level=info msg="capturing logs" argo=true
getting data
2023-02-21 23:47:25.569490: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-21 23:47:25.569521: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/opt/conda/lib/python3.8/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1007, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.8/http/client.py", line 947, in send
    self.connect()
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connection.py", line 181, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa9c03eff10>: Failed to establish a new connection: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/tmp/tmp.oceiY470Q3", line 76, in <module>
    _outputs = get_data_batch(**_parsed_args)
  File "/tmp/tmp.oceiY470Q3", line 19, in get_data_batch
    minio_client.fget_object(minio_bucket,"mnist.npz","/tmp/mnist.npz")
  File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 787, in fget_object
    stat = self.stat_object(
  File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 1195, in stat_object
    response = self._url_open(
  File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2189, in _url_open
    region = self._get_bucket_region(bucket_name)
  File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2067, in _get_bucket_region
    region = self._get_bucket_location(bucket_name)
  File "/opt/conda/lib/python3.8/site-packages/minio/api.py", line 2100, in _get_bucket_location
    response = self._http.urlopen(method, url,
  File "/opt/conda/lib/python3.8/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 783, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='100.65.11.110', port=9000): Max retries exceeded with url: /mlpipeline?location= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa9c03eff10>: Failed to establish a new connection: [Errno 110] Connection timed out'))
time="2023-02-22T00:00:37.229Z" level=error msg="cannot save artifact /tmp/outputs/datapoints_test/data" argo=true error="stat /tmp/outputs/datapoints_test/data: no such file or directory"
time="2023-02-22T00:00:37.229Z" level=error msg="cannot save artifact /tmp/outputs/datapoints_training/data" argo=true error="stat /tmp/outputs/datapoints_training/data: no such file or directory"
time="2023-02-22T00:00:37.229Z" level=error msg="cannot save artifact /tmp/outputs/dataset_version/data" argo=true error="stat /tmp/outputs/dataset_version/data: no such file or directory"
Error: exit status 1

What might be wrong? kubeflow version is 1.6.1

jsitu777 commented 1 year ago

from pip list

kfp                          1.8.13
kfp-pipeline-spec            0.1.16
kserve                       0.9.0
lantgabor commented 1 year ago

Same error for me even after setting up the serviceAccount. Did you figure it out?

reveever commented 1 year ago

Same error for me even after setting up the serviceAccount. Did you figure it out?

Download an mnist.npz file from the internet, and then place it in the root directory of Minio https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz