PacktPublishing / Machine-Learning-on-Kubernetes

Machine Learning on Kubernetes, published by packt
MIT License
68 stars 45 forks source link

Chapter07: subprocess.CalledProcessError: Command '['python3', 'build_push_image.py']' returned non-zero exit status 1. #14

Closed aquynh1682 closed 1 year ago

aquynh1682 commented 1 year ago

Platform: Kubespray version 1.26.6

Wishing you all a good day, I am following along in your documentation, up to page 210.

And I'm encountering an error while building on Airflow, specifically in the build_push_image step. From what I can see, it just finished executing the step to install the requirements.txt. And then I encountered this error: subprocess.CalledProcessError: Command '['python3', 'build_push_image.py']' returned non-zero exit status 1. (I don't understand why Airflow treats this as an info, but when I check the pod logs, it shows this as an error, lol :))). Please help me investigate this error. Below is the full log and the structure inside the build_push_image.py file. Thank you very much.

Log in pod:

[root@k8s-master airflow-dags]# kubectl logs -f -n ml-workshop build-push-image.2bc94f1f645145b2a24d8577fa5c32de % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 18688 100 18688 0 0 50644 0 --:--:-- --:--:-- --:--:-- 50644 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1282 100 1282 0 0 11446 0 --:--:-- --:--:-- --:--:-- 11446 Requirement already satisfied: packaging in /usr/local/lib/python3.7/site-packages (21.3) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/site-packages (from packaging) (3.0.6) WARNING: You are using pip version 20.3.1; however, version 23.1.2 is available. You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command. [I 01:58:35.771] 'model_deploy-0628015808':'build_push_image' - starting operation [I 01:58:35.771] 'model_deploy-0628015808':'build_push_image' - Installing packages [I 01:58:35.771] Package not found. Installing ipykernel package with version 5.3.0... [I 01:58:35.772] Package not found. Installing ipython package with version 7.15.0... [I 01:58:35.772] Package not found. Installing ipython-genutils package with version 0.2.0... [I 01:58:35.772] Package not found. Installing jupyter-client package with version 6.1.6... [I 01:58:35.772] Package not found. Installing jupyter-core package with version 4.6.3... [I 01:58:35.772] Newer minio package with version 6.0.2 already installed. Skipping... [I 01:58:35.772] Package not found. Installing nbclient package with version 0.4.1... [I 01:58:35.772] Package not found. Installing nbconvert package with version 5.6.1... [I 01:58:35.772] Package not found. Installing nbformat package with version 5.0.7... [I 01:58:35.772] Package not found. Installing papermill package with version 2.1.2... [I 01:58:35.772] Package not found. Installing pyzmq package with version 19.0.1... [I 01:58:35.772] Package not found. Installing prompt-toolkit package with version 3.0.5... [I 01:58:35.772] Newer requests package with version 2.25.0 already installed. Skipping... [I 01:58:35.773] Newer tornado package with version 6.1 already installed. Skipping... [I 01:58:35.773] Package not found. Installing traitlets package with version 4.3.3... [I 01:58:35.773] Newer urllib3 package with version 1.26.2 already installed. Skipping... ...... WARNING: You are using pip version 20.3.1; however, version 23.1.2 is available. You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command. .... [I 01:58:52.394] 'model_deploy-0628015808':'build_push_image' - Packages installed (16.623 secs) [I 01:58:52.494] 'model_deploy-0628015808':'build_push_image' - processing dependencies [I 01:58:52.779] 'model_deploy-0628015808':'build_push_image' - downloaded build_push_image-30e0375e-66aa-4f9a-994c-77e7814be449.tar.gz from bucket: airflow, object: model_deploy-0628015808/build_push_image-30e0375e-66aa-4f9a-994c-77e7814be449.tar.gz (0.285 secs) / Dockerfile tar: Removing leading `/' from member names Predictor.py base_requirements.txt build_push_image.py [I 01:58:52.786] 'model_deploy-0628015808':'build_push_image' - dependencies processed (0.292 secs) [I 01:58:52.786] 'model_deploy-0628015808':'build_push_image' - executing python script using 'python3 build_push_image.py' to 'build_push_image.log' [E 01:58:57.482] Unexpected error: <class 'subprocess.CalledProcessError'> [E 01:58:57.483] Error details: Command '['python3', 'build_push_image.py']' returned non-zero exit status 1. [I 01:58:57.538] 'model_deploy-0628015808':'build_push_image' - uploaded build_push_image.log to bucket: airflow object: model_deploy-0628015808/build_push_image.log (0.056 secs) Traceback (most recent call last): File "bootstrapper.py", line 430, in main() File "bootstrapper.py", line 423, in main file_op.execute() File "bootstrapper.py", line 274, in execute raise ex File "bootstrapper.py", line 261, in execute subprocess.run(['python3', python_script], stdout=log_file, stderr=subprocess.STDOUT, check=True) File "/usr/local/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['python3', 'build_push_image.py']' returned non-zero exit status 1.

log in airflow:

[2023-06-28 01:58:28,165] {taskinstance.py:903} INFO - Dependencies all met for <TaskInstance: model_deploy-0628015808.build_push_image 2023-06-27T00:00:00+00:00 [queued]> [2023-06-28 01:58:28,407] {taskinstance.py:903} INFO - Dependencies all met for <TaskInstance: model_deploy-0628015808.build_push_image 2023-06-27T00:00:00+00:00 [queued]>

[2023-06-28 01:58:28,407] {taskinstance.py:1094} INFO -

[2023-06-28 01:58:28,407] {taskinstance.py:1095} INFO - Starting attempt 1 of 1 [2023-06-28 01:58:28,407] {taskinstance.py:1096} INFO -

[2023-06-28 01:58:28,839] {taskinstance.py:1114} INFO - Executing <Task(NotebookOp): build_push_image> on 2023-06-27T00:00:00+00:00 [2023-06-28 01:58:28,843] {standard_task_runner.py:52} INFO - Started process 412 to run task [2023-06-28 01:58:28,849] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'model_deploy-0628015808', 'build_push_image', '2023-06-27T00:00:00+00:00', '--job-id', '27', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/gitdags/model_deploy-0628015808.py', '--cfg-path', '/tmp/tmpzr3gsi1w', '--error-file', '/tmp/tmpml3n40xm'] [2023-06-28 01:58:28,850] {standard_task_runner.py:77} INFO - Job 27: Subtask build_push_image [2023-06-28 01:58:30,367] {logging_mixin.py:109} INFO - Running <TaskInstance: model_deploy-0628015808.build_push_image 2023-06-27T00:00:00+00:00 [running]> on host app-aflow-airflow-worker-0.app-aflow-airflow-worker-headless.ml-workshop.svc.cluster.local [2023-06-28 01:58:32,245] {taskinstance.py:1251} INFO - Exporting the following env vars: AIRFLOW_CTX_DAG_OWNER=airflow AIRFLOW_CTX_DAG_ID=model_deploy-0628015808 AIRFLOW_CTX_TASK_ID=build_push_image AIRFLOW_CTX_EXECUTION_DATE=2023-06-27T00:00:00+00:00 AIRFLOW_CTX_DAG_RUN_ID=scheduled__2023-06-27T00:00:00+00:00 [2023-06-28 01:58:32,266] {kubernetes_pod.py:368} INFO - creating pod with labels {'dag_id': 'model_deploy-0628015808', 'task_id': 'build_push_image', 'execution_date': '2023-06-27T0000000000-63dd9c5d6', 'try_number': '1'} and launcher <airflow.providers.cncf.kubernetes.utils.pod_launcher.PodLauncher object at 0x7f4f9cd36e50> [2023-06-28 01:58:32,302] {pod_launcher.py:198} INFO - Event: build-push-image.2bc94f1f645145b2a24d8577fa5c32de had an event of type Pending [2023-06-28 01:58:32,302] {pod_launcher.py:128} WARNING - Pod not yet started: build-push-image.2bc94f1f645145b2a24d8577fa5c32de [2023-06-28 01:58:33,310] {pod_launcher.py:198} INFO - Event: build-push-image.2bc94f1f645145b2a24d8577fa5c32de had an event of type Pending [2023-06-28 01:58:33,310] {pod_launcher.py:128} WARNING - Pod not yet started: build-push-image.2bc94f1f645145b2a24d8577fa5c32de [2023-06-28 01:58:34,320] {pod_launcher.py:198} INFO - Event: build-push-image.2bc94f1f645145b2a24d8577fa5c32de had an event of type Running [2023-06-28 01:58:34,335] {pod_launcher.py:149} INFO - % Total % Received % Xferd Average Speed Time Time Time Current [2023-06-28 01:58:34,335] {pod_launcher.py:149} INFO - Dload Upload Total Spent Left Speed [2023-06-28 01:58:34,335] {pod_launcher.py:149} INFO - 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 18688 100 18688 0 0 50644 0 --:--:-- --:--:-- --:--:-- 50644 [2023-06-28 01:58:34,335] {pod_launcher.py:149} INFO - % Total % Received % Xferd Average Speed Time Time Time Current [2023-06-28 01:58:34,335] {pod_launcher.py:149} INFO - Dload Upload Total Spent Left Speed [2023-06-28 01:58:34,336] {pod_launcher.py:149} INFO - 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 1282 100 1282 0 0 11446 0 --:--:-- --:--:-- --:--:-- 11446 [2023-06-28 01:58:34,336] {pod_launcher.py:149} INFO - Requirement already satisfied: packaging in /usr/local/lib/python3.7/site-packages (21.3) [2023-06-28 01:58:34,336] {pod_launcher.py:149} INFO - Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/site-packages (from packaging) (3.0.6) [2023-06-28 01:58:35,213] {pod_launcher.py:149} INFO - WARNING: You are using pip version 20.3.1; however, version 23.1.2 is available. [2023-06-28 01:58:35,214] {pod_launcher.py:149} INFO - You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command. [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.771] 'model_deploy-0628015808':'build_push_image' - starting operation [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.771] 'model_deploy-0628015808':'build_push_image' - Installing packages [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.771] Package not found. Installing ipykernel package with version 5.3.0... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing ipython package with version 7.15.0... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing ipython-genutils package with version 0.2.0... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing jupyter-client package with version 6.1.6... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing jupyter-core package with version 4.6.3... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Newer minio package with version 6.0.2 already installed. Skipping... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing nbclient package with version 0.4.1... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing nbconvert package with version 5.6.1... [2023-06-28 01:58:35,774] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing nbformat package with version 5.0.7... [2023-06-28 01:58:35,775] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing papermill package with version 2.1.2... [2023-06-28 01:58:35,775] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing pyzmq package with version 19.0.1... [2023-06-28 01:58:35,775] {pod_launcher.py:149} INFO - [I 01:58:35.772] Package not found. Installing prompt-toolkit package with version 3.0.5... [2023-06-28 01:58:35,775] {pod_launcher.py:149} INFO - [I 01:58:35.772] Newer requests package with version 2.25.0 already installed. Skipping... [2023-06-28 01:58:35,775] {pod_launcher.py:149} INFO - [I 01:58:35.773] Newer tornado package with version 6.1 already installed. Skipping... [2023-06-28 01:58:35,775] {pod_launcher.py:149} INFO - [I 01:58:35.773] Package not found. Installing traitlets package with version 4.3.3... [2023-06-28 01:58:35,775] {pod_launcher.py:149} INFO - [I 01:58:35.773] Newer urllib3 package with version 1.26.2 already installed. Skipping... ....... [2023-06-28 01:58:51,789] {pod_launcher.py:149} INFO - Successfully installed ansiwrap-0.8.4 async-generator-1.10 backcall-0.2.0 black-23.3.0 bleach-6.0.0 click-8.1.3 decorator-5.1.1 defusedxml-0.7.1 ipykernel-5.3.0 ipython-7.15.0 ipython-genutils-0.2.0 jedi-0.18.2 jupyter-client-6.1.6 jupyter-core-4.6.3 mistune-0.8.4 mypy-extensions-1.0.0 nbclient-0.4.1 nbconvert-5.6.1 nbformat-5.0.7 nest-asyncio-1.5.6 packaging-23.1 pandocfilters-1.5.0 papermill-2.1.2 parso-0.8.3 pathspec-0.11.1 pexpect-4.8.0 pickleshare-0.7.5 platformdirs-3.8.0 prompt-toolkit-3.0.5 ptyprocess-0.7.0 pygments-2.15.1 pyzmq-19.0.1 tenacity-8.2.2 testpath-0.6.0 textwrap3-0.9.2 tomli-2.0.1 tqdm-4.65.0 traitlets-4.3.3 typed-ast-1.5.4 typing-extensions-4.6.3 wcwidth-0.2.6 webencodings-0.5.1 [2023-06-28 01:58:51,825] {pod_launcher.py:149} INFO - WARNING: You are using pip version 20.3.1; however, version 23.1.2 is available. [2023-06-28 01:58:51,826] {pod_launcher.py:149} INFO - You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command. .... [2023-06-28 01:58:52,373] {pod_launcher.py:149} INFO - Werkzeug==1.0.1 [2023-06-28 01:58:52,373] {pod_launcher.py:149} INFO - zipp==3.4.0 [2023-06-28 01:58:52,395] {pod_launcher.py:149} INFO - [I 01:58:52.394] 'model_deploy-0628015808':'build_push_image' - Packages installed (16.623 secs) [2023-06-28 01:58:52,496] {pod_launcher.py:149} INFO - [I 01:58:52.494] 'model_deploy-0628015808':'build_push_image' - processing dependencies [2023-06-28 01:58:52,780] {pod_launcher.py:149} INFO - [I 01:58:52.779] 'model_deploy-0628015808':'build_push_image' - downloaded build_push_image-30e0375e-66aa-4f9a-994c-77e7814be449.tar.gz from bucket: airflow, object: model_deploy-0628015808/build_push_image-30e0375e-66aa-4f9a-994c-77e7814be449.tar.gz (0.285 secs) [2023-06-28 01:58:52,786] {pod_launcher.py:149} INFO - / [2023-06-28 01:58:52,786] {pod_launcher.py:149} INFO - Dockerfile [2023-06-28 01:58:52,786] {pod_launcher.py:149} INFO - tar: Removing leading `/' from member names [2023-06-28 01:58:52,786] {pod_launcher.py:149} INFO - Predictor.py [2023-06-28 01:58:52,787] {pod_launcher.py:149} INFO - base_requirements.txt [2023-06-28 01:58:52,787] {pod_launcher.py:149} INFO - build_push_image.py [2023-06-28 01:58:52,787] {pod_launcher.py:149} INFO - [I 01:58:52.786] 'model_deploy-0628015808':'build_push_image' - dependencies processed (0.292 secs) [2023-06-28 01:58:52,787] {pod_launcher.py:149} INFO - [I 01:58:52.786] 'model_deploy-0628015808':'build_push_image' - executing

python script using 'python3 build_push_image.py' to 'build_push_image.log' [2023-06-28 01:58:57,483] {pod_launcher.py:149} INFO - [E 01:58:57.482] Unexpected error: <class 'subprocess.CalledProcessError'> [2023-06-28 01:58:57,484] {pod_launcher.py:149} INFO - [E 01:58:57.483] Error details: Command '['python3', 'build_push_image.py']' return ed non-zero exit status 1. [2023-06-28 01:58:57,539] {pod_launcher.py:149} INFO - [I 01:58:57.538] 'model_deploy-0628015808':'build_push_image' - uploaded build_push_image.log to bucket: airflow object: model_deploy-0628015808/build_push_image.log (0.056 secs) [2023-06-28 01:58:57,547] {pod_launcher.py:149} INFO - Traceback (most recent call last): [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - File "bootstrapper.py", line 430, in [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - main() [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - File "bootstrapper.py", line 423, in main [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - file_op.execute() [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - File "bootstrapper.py", line 274, in execute [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - raise ex [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - File "bootstrapper.py", line 261, in execute [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - subprocess.run(['python3', python_script], stdout=log_file, stderr=subprocess.STDOUT, check=True) [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - File "/usr/local/lib/python3.7/subprocess.py", line 512, in run [2023-06-28 01:58:57,548] {pod_launcher.py:149} INFO - output=stdout, stderr=stderr) [2023-06-28 01:58:57,549] {pod_launcher.py:149} INFO - subprocess.CalledProcessError: Command '['python3', 'build_push_image.py']' returned non-zero exit status 1. [2023-06-28 01:59:01,568] {pod_launcher.py:198} INFO - Event: build-push-image.2bc94f1f645145b2a24d8577fa5c32de had an event of type Running [2023-06-28 01:59:01,568] {pod_launcher.py:171} INFO - Pod build-push-image.2bc94f1f645145b2a24d8577fa5c32de has state running [2023-06-28 01:59:03,579] {pod_launcher.py:198} INFO - Event: build-push-image.2bc94f1f645145b2a24d8577fa5c32de had an event of type Failed [2023-06-28 01:59:03,579] {pod_launcher.py:308} ERROR - Event with job id build-push-image.2bc94f1f645145b2a24d8577fa5c32de Failed [2023-06-28 01:59:03,587] {pod_launcher.py:198} INFO - Event: build-push-image.2bc94f1f645145b2a24d8577fa5c32de had an event of type Failed [2023-06-28 01:59:03,587] {pod_launcher.py:308} ERROR - Event with job id build-push-image.2bc94f1f645145b2a24d8577fa5c32de Failed [2023-06-28 01:59:03,965] {taskinstance.py:1462} ERROR - Task failed with exception Traceback (most recent call last): File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 371, in execute raise AirflowException(f'Pod {self.pod.metadata.name} returned a failure: {remote_pod}') airflow.exceptions.AirflowException: Pod build-push-image.2bc94f1f645145b2a24d8577fa5c32de returned a failure: {'api_version': 'v1', 'kind': 'Pod', 'metadata': {'annotations': {'cni.projectcalico.org/containerID': '6eed1d1530a24956ba1d2ae739fd0d751ffca0ceb08053b23bf5a7f4234487fc', 'cni.projectcalico.org/podIP': '', 'cni.projectcalico.org/podIPs': '', 'kubernetes.io/limit-ranger': 'LimitRanger ' 'plugin set: cpu, ' 'memory request ' 'for container ' 'base; cpu, memory ' 'limit for ' 'container base'}, 'cluster_name': None, 'creation_timestamp': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal()), 'deletion_grace_period_seconds': None, 'deletion_timestamp': None, 'finalizers': None, 'generate_name': None, 'generation': None, 'initializers': None, 'labels': {'airflow_version': '2.1.3', 'dag_id': 'model_deploy-0628015808', 'execution_date': '2023-06-27T0000000000-63dd9c5d6', 'kubernetes_pod_operator': 'True', 'task_id': 'build_push_image', 'try_number': '1'}, 'managed_fields': [{'api_version': 'v1', 'fields': None, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal())}, {'api_version': 'v1', 'fields': None, 'manager': 'calico', 'operation': 'Update', 'time': datetime.datetime(2023, 6, 28, 1, 59, 1, tzinfo=tzlocal())}, {'api_version': 'v1', 'fields': None, 'manager': 'kubelet', 'operation': 'Update', 'time': datetime.datetime(2023, 6, 28, 1, 59, 2, tzinfo=tzlocal())}], 'name': 'build-push-image.2bc94f1f645145b2a24d8577fa5c32de', 'namespace': 'ml-workshop', 'owner_references': None, 'resource_version': '476630', 'self_link': None, 'uid': '1fa4dfdf-fa59-470c-a580-65d316edd9cf'}, 'spec': {'active_deadline_seconds': None, 'affinity': {'node_affinity': None, 'pod_affinity': None, 'pod_anti_affinity': None}, 'automount_service_account_token': None, 'containers': [{'args': ['mkdir -p ./jupyter-work-dir/ && cd ' './jupyter-work-dir/ && curl -H ' "'Cache-Control: no-cache' -L " 'https://raw.githubusercontent.com/elyra-ai/airflow-notebook/v0.0.7/etc/docker-scripts/bootstrapper.py ' '--output bootstrapper.py && curl -H ' "'Cache-Control: no-cache' -L " 'https://raw.githubusercontent.com/elyra-ai/airflow-notebook/v0.0.7/etc/requirements-elyra.txt ' '--output requirements-elyra.txt && python3 ' '-m pip install packaging && python3 -m pip ' 'freeze > requirements-current.txt && ' 'python3 bootstrapper.py --cos-endpoint ' 'http://minio-ml-workshop:9000// ' '--cos-bucket airflow --cos-directory ' "'model_deploy-0628015808' " '--cos-dependencies-archive ' "'build_push_image-30e0375e-66aa-4f9a-994c-77e7814be449.tar.gz' " '--file ' "'Machine-Learning-on-Kubernetes/Chapter07/model_deploy_pipeline/model_build_push/build_push_image.py' "], 'command': ['sh', '-c'], 'env': [{'name': 'AWS_ACCESS_KEY_ID', 'value': 'minio', 'value_from': None}, {'name': 'AWS_SECRET_ACCESS_KEY', 'value': 'minio123', 'value_from': None}, {'name': 'ELYRA_ENABLE_PIPELINE_INFO', 'value': 'True', 'value_from': None}, {'name': 'MODEL_NAME', 'value': 'mlflowdemo', 'value_from': None}, {'name': 'MODEL_VERSION', 'value': '1', 'value_from': None}, {'name': 'CONTAINER_REGISTRY', 'value': 'https://quay.io/', 'value_from': None}, {'name': 'CONTAINER_REGISTRY_USER', 'value': '', 'value_from': None}, {'name': 'CONTAINER_REGISTRY_PASSWORD', 'value': '', 'value_from': None}, {'name': 'CONTAINER_DETAILS', 'value': '', 'value_from': None}], 'env_from': None, 'image': 'quay.io/ml-on-k8s/kaniko-container-builder:1.0.0', 'image_pull_policy': 'Never', 'lifecycle': None, 'liveness_probe': None, 'name': 'base', 'ports': None, 'readiness_probe': None, 'resources': {'limits': {'cpu': '1', 'memory': '1Gi'}, 'requests': {'cpu': '100m', 'memory': '500Mi'}}, 'security_context': None, 'stdin': None, 'stdin_once': None, 'termination_message_path': '/dev/termination-log', 'termination_message_policy': 'File', 'tty': None, 'volume_devices': None, 'volume_mounts': [{'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount', 'mount_propagation': None, 'name': 'kube-api-access-68j6b', 'read_only': True, 'sub_path': None, 'sub_path_expr': None}], 'working_dir': None}], 'dns_config': None, 'dns_policy': 'ClusterFirst', 'enable_service_links': True, 'host_aliases': None, 'host_ipc': None, 'host_network': None, 'host_pid': None, 'hostname': None, 'image_pull_secrets': None, 'init_containers': None, 'node_name': 'k8s-master', 'node_selector': None, 'preemption_policy': 'PreemptLowerPriority', 'priority': 0, 'priority_class_name': None, 'readiness_gates': None, 'restart_policy': 'Never', 'runtime_class_name': None, 'scheduler_name': 'default-scheduler', 'security_context': {'fs_group': None, 'run_as_group': None, 'run_as_non_root': None, 'run_as_user': None, 'se_linux_options': None, 'supplemental_groups': None, 'sysctls': None, 'windows_options': None}, 'service_account': 'default', 'service_account_name': 'default', 'share_process_namespace': None, 'subdomain': None, 'termination_grace_period_seconds': 30, 'tolerations': [{'effect': 'NoExecute', 'key': 'node.kubernetes.io/not-ready', 'operator': 'Exists', 'toleration_seconds': 300, 'value': None}, {'effect': 'NoExecute', 'key': 'node.kubernetes.io/unreachable', 'operator': 'Exists', 'toleration_seconds': 300, 'value': None}], 'volumes': [{'aws_elastic_block_store': None, 'azure_disk': None, 'azure_file': None, 'cephfs': None, 'cinder': None, 'config_map': None, 'csi': None, 'downward_api': None, 'empty_dir': None, 'fc': None, 'flex_volume': None, 'flocker': None, 'gce_persistent_disk': None, 'git_repo': None, 'glusterfs': None, 'host_path': None, 'iscsi': None, 'name': 'kube-api-access-68j6b', 'nfs': None, 'persistent_volume_claim': None, 'photon_persistent_disk': None, 'portworx_volume': None, 'projected': {'default_mode': 420, 'sources': [{'config_map': None, 'downward_api': None, 'secret': None, 'service_account_token': {'audience': None, 'expiration_seconds': 3607, 'path': 'token'}}, {'config_map': {'items': [{'key': 'ca.crt', 'mode': None, 'path': 'ca.crt'}], 'name': 'kube-root-ca.crt', 'optional': None}, 'downward_api': None, 'secret': None, 'service_account_token': None}, {'config_map': None, 'downward_api': {'items': [{'field_ref': {'api_version': 'v1', 'field_path': 'metadata.namespace'}, 'mode': None, 'path': 'namespace', 'resource_field_ref': None}]}, 'secret': None, 'service_account_token': None}]}, 'quobyte': None, 'rbd': None, 'scale_io': None, 'secret': None, 'storageos': None, 'vsphere_volume': None}]}, 'status': {'conditions': [{'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal()), 'message': None, 'reason': None, 'status': 'True', 'type': 'Initialized'}, {'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 59, tzinfo=tzlocal()), 'message': None, 'reason': 'PodFailed', 'status': 'False', 'type': 'Ready'}, {'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 59, tzinfo=tzlocal()), 'message': None, 'reason': 'PodFailed', 'status': 'False', 'type': 'ContainersReady'}, {'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal()), 'message': None, 'reason': None, 'status': 'True', 'type': 'PodScheduled'}], 'container_statuses': [{'container_id': 'containerd://f98598d2b0c14cc945fd8631b81fd17d1310bcd2fd931f143007a286e3db232f', 'image': 'quay.io/ml-on-k8s/kaniko-container-builder:1.0.0', 'image_id': 'quay.io/ml-on-k8s/kaniko-container-builder@sha256:7204881d5ba9c83f8b5b5580ef716c91d374f728c003faa3af9f1bd047e8535e', 'last_state': {'running': None, 'terminated': None, 'waiting': None}, 'name': 'base', 'ready': False, 'restart_count': 0, 'state': {'running': None, 'terminated': {'container_id': 'containerd://f98598d2b0c14cc945fd8631b81fd17d1310bcd2fd931f143007a286e3db232f', 'exit_code': 1, 'finished_at': datetime.datetime(2023, 6, 28, 1, 58, 57, tzinfo=tzlocal()), 'message': None, 'reason': 'Error', 'signal': None, 'started_at': datetime.datetime(2023, 6, 28, 1, 58, 33, tzinfo=tzlocal())}, 'waiting': None}}], 'host_ip': '10.1.0.4', 'init_container_statuses': None, 'message': None, 'nominated_node_name': None, 'phase': 'Failed', 'pod_ip': '10.233.123.99', 'qos_class': 'Burstable', 'reason': None, 'start_time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal())}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task self._prepare_and_execute_task_with_callbacks(context, task) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks result = self._execute_task(context, task_copy) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task result = task_copy.execute(context=context) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 376, in execute raise AirflowException(f'Pod Launching failed: {ex}') airflow.exceptions.AirflowException: Pod Launching failed: Pod build-push-image.2bc94f1f645145b2a24d8577fa5c32de returned a failure: {'api_version': 'v1', 'kind': 'Pod', 'metadata': {'annotations': {'cni.projectcalico.org/containerID': '6eed1d1530a24956ba1d2ae739fd0d751ffca0ceb08053b23bf5a7f4234487fc', 'cni.projectcalico.org/podIP': '', 'cni.projectcalico.org/podIPs': '', 'kubernetes.io/limit-ranger': 'LimitRanger ' 'plugin set: cpu, ' 'memory request ' 'for container ' 'base; cpu, memory ' 'limit for ' 'container base'}, 'cluster_name': None, 'creation_timestamp': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal()), 'deletion_grace_period_seconds': None, 'deletion_timestamp': None, 'finalizers': None, 'generate_name': None, 'generation': None, 'initializers': None, 'labels': {'airflow_version': '2.1.3', 'dag_id': 'model_deploy-0628015808', 'execution_date': '2023-06-27T0000000000-63dd9c5d6', 'kubernetes_pod_operator': 'True', 'task_id': 'build_push_image', 'try_number': '1'}, 'managed_fields': [{'api_version': 'v1', 'fields': None, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal())}, {'api_version': 'v1', 'fields': None, 'manager': 'calico', 'operation': 'Update', 'time': datetime.datetime(2023, 6, 28, 1, 59, 1, tzinfo=tzlocal())}, {'api_version': 'v1', 'fields': None, 'manager': 'kubelet', 'operation': 'Update', 'time': datetime.datetime(2023, 6, 28, 1, 59, 2, tzinfo=tzlocal())}], 'name': 'build-push-image.2bc94f1f645145b2a24d8577fa5c32de', 'namespace': 'ml-workshop', 'owner_references': None, 'resource_version': '476630', 'self_link': None, 'uid': '1fa4dfdf-fa59-470c-a580-65d316edd9cf'}, 'spec': {'active_deadline_seconds': None, 'affinity': {'node_affinity': None, 'pod_affinity': None, 'pod_anti_affinity': None}, 'automount_service_account_token': None, 'containers': [{'args': ['mkdir -p ./jupyter-work-dir/ && cd ' './jupyter-work-dir/ && curl -H ' "'Cache-Control: no-cache' -L " 'https://raw.githubusercontent.com/elyra-ai/airflow-notebook/v0.0.7/etc/docker-scripts/bootstrapper.py ' '--output bootstrapper.py && curl -H ' "'Cache-Control: no-cache' -L " 'https://raw.githubusercontent.com/elyra-ai/airflow-notebook/v0.0.7/etc/requirements-elyra.txt ' '--output requirements-elyra.txt && python3 ' '-m pip install packaging && python3 -m pip ' 'freeze > requirements-current.txt && ' 'python3 bootstrapper.py --cos-endpoint ' 'http://minio-ml-workshop:9000// ' '--cos-bucket airflow --cos-directory ' "'model_deploy-0628015808' " '--cos-dependencies-archive ' "'build_push_image-30e0375e-66aa-4f9a-994c-77e7814be449.tar.gz' " '--file ' "'Machine-Learning-on-Kubernetes/Chapter07/model_deploy_pipeline/model_build_push/build_push_image.py' "], 'command': ['sh', '-c'], 'env': [{'name': 'AWS_ACCESS_KEY_ID', 'value': 'minio', 'value_from': None}, {'name': 'AWS_SECRET_ACCESS_KEY', 'value': 'minio123', 'value_from': None}, {'name': 'ELYRA_ENABLE_PIPELINE_INFO', 'value': 'True', 'value_from': None}, {'name': 'MODEL_NAME', 'value': 'mlflowdemo', 'value_from': None}, {'name': 'MODEL_VERSION', 'value': '1', 'value_from': None}, {'name': 'CONTAINER_REGISTRY', 'value': 'https://quay.io/', 'value_from': None}, {'name': 'CONTAINER_REGISTRY_USER', 'value': '', 'value_from': None}, {'name': 'CONTAINER_REGISTRY_PASSWORD', 'value': '', 'value_from': None}, {'name': 'CONTAINER_DETAILS', 'value': '', 'value_from': None}], 'env_from': None, 'image': 'quay.io/ml-on-k8s/kaniko-container-builder:1.0.0', 'image_pull_policy': 'Never', 'lifecycle': None, 'liveness_probe': None, 'name': 'base', 'ports': None, 'readiness_probe': None, 'resources': {'limits': {'cpu': '1', 'memory': '1Gi'}, 'requests': {'cpu': '100m', 'memory': '500Mi'}}, 'security_context': None, 'stdin': None, 'stdin_once': None, 'termination_message_path': '/dev/termination-log', 'termination_message_policy': 'File', 'tty': None, 'volume_devices': None, 'volume_mounts': [{'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount', 'mount_propagation': None, 'name': 'kube-api-access-68j6b', 'read_only': True, 'sub_path': None, 'sub_path_expr': None}], 'working_dir': None}], 'dns_config': None, 'dns_policy': 'ClusterFirst', 'enable_service_links': True, 'host_aliases': None, 'host_ipc': None, 'host_network': None, 'host_pid': None, 'hostname': None, 'image_pull_secrets': None, 'init_containers': None, 'node_name': 'k8s-master', 'node_selector': None, 'preemption_policy': 'PreemptLowerPriority', 'priority': 0, 'priority_class_name': None, 'readiness_gates': None, 'restart_policy': 'Never', 'runtime_class_name': None, 'scheduler_name': 'default-scheduler', 'security_context': {'fs_group': None, 'run_as_group': None, 'run_as_non_root': None, 'run_as_user': None, 'se_linux_options': None, 'supplemental_groups': None, 'sysctls': None, 'windows_options': None}, 'service_account': 'default', 'service_account_name': 'default', 'share_process_namespace': None, 'subdomain': None, 'termination_grace_period_seconds': 30, 'tolerations': [{'effect': 'NoExecute', 'key': 'node.kubernetes.io/not-ready', 'operator': 'Exists', 'toleration_seconds': 300, 'value': None}, {'effect': 'NoExecute', 'key': 'node.kubernetes.io/unreachable', 'operator': 'Exists', 'toleration_seconds': 300, 'value': None}], 'volumes': [{'aws_elastic_block_store': None, 'azure_disk': None, 'azure_file': None, 'cephfs': None, 'cinder': None, 'config_map': None, 'csi': None, 'downward_api': None, 'empty_dir': None, 'fc': None, 'flex_volume': None, 'flocker': None, 'gce_persistent_disk': None, 'git_repo': None, 'glusterfs': None, 'host_path': None, 'iscsi': None, 'name': 'kube-api-access-68j6b', 'nfs': None, 'persistent_volume_claim': None, 'photon_persistent_disk': None, 'portworx_volume': None, 'projected': {'default_mode': 420, 'sources': [{'config_map': None, 'downward_api': None, 'secret': None, 'service_account_token': {'audience': None, 'expiration_seconds': 3607, 'path': 'token'}}, {'config_map': {'items': [{'key': 'ca.crt', 'mode': None, 'path': 'ca.crt'}], 'name': 'kube-root-ca.crt', 'optional': None}, 'downward_api': None, 'secret': None, 'service_account_token': None}, {'config_map': None, 'downward_api': {'items': [{'field_ref': {'api_version': 'v1', 'field_path': 'metadata.namespace'}, 'mode': None, 'path': 'namespace', 'resource_field_ref': None}]}, 'secret': None, 'service_account_token': None}]}, 'quobyte': None, 'rbd': None, 'scale_io': None, 'secret': None, 'storageos': None, 'vsphere_volume': None}]}, 'status': {'conditions': [{'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal()), 'message': None, 'reason': None, 'status': 'True', 'type': 'Initialized'}, {'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 59, tzinfo=tzlocal()), 'message': None, 'reason': 'PodFailed', 'status': 'False', 'type': 'Ready'}, {'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 59, tzinfo=tzlocal()), 'message': None, 'reason': 'PodFailed', 'status': 'False', 'type': 'ContainersReady'}, {'last_probe_time': None, 'last_transition_time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal()), 'message': None, 'reason': None, 'status': 'True', 'type': 'PodScheduled'}], 'container_statuses': [{'container_id': 'containerd://f98598d2b0c14cc945fd8631b81fd17d1310bcd2fd931f143007a286e3db232f', 'image': 'quay.io/ml-on-k8s/kaniko-container-builder:1.0.0', 'image_id': 'quay.io/ml-on-k8s/kaniko-container-builder@sha256:7204881d5ba9c83f8b5b5580ef716c91d374f728c003faa3af9f1bd047e8535e', 'last_state': {'running': None, 'terminated': None, 'waiting': None}, 'name': 'base', 'ready': False, 'restart_count': 0, 'state': {'running': None, 'terminated': {'container_id': 'containerd://f98598d2b0c14cc945fd8631b81fd17d1310bcd2fd931f143007a286e3db232f', 'exit_code': 1, 'finished_at': datetime.datetime(2023, 6, 28, 1, 58, 57, tzinfo=tzlocal()), 'message': None, 'reason': 'Error', 'signal': None, 'started_at': datetime.datetime(2023, 6, 28, 1, 58, 33, tzinfo=tzlocal())}, 'waiting': None}}], 'host_ip': '10.1.0.4', 'init_container_statuses': None, 'message': None, 'nominated_node_name': None, 'phase': 'Failed', 'pod_ip': '10.233.123.99', 'qos_class': 'Burstable', 'reason': None, 'start_time': datetime.datetime(2023, 6, 28, 1, 58, 32, tzinfo=tzlocal())}} [2023-06-28 01:59:03,982] {taskinstance.py:1505} INFO - Marking task as FAILED. dag_id=model_deploy-0628015808, task_id=build_push_image, execution_date=20230627T000000, start_date=20230628T015828, end_date=20230628T015903 [2023-06-28 01:59:05,176] {local_task_job.py:151} INFO - Task exited with return code 1 [2023-06-28 01:59:05,994] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check

file python build_push_image.py:

import string
import subprocess
import os
import base64
import mlflow
from minio import Minio
from mlflow.tracking import MlflowClient

"""
    This script assumes that the /kaniko/.docker/config.json has the correct repo and associated credentials mounted
    It also expects the these env variables has been set
    CONTAINER_REGISTRY is the resitry server like quay.io
    CONTAINER_DETAILS is the container coordinates like ml-on-k8s/containermodel:1.0.0
    AWS_SECRET_ACCESS_KEY is the password for the S3 store
    MODEL_NAME is hte name of the model in mlflow
    MODEL_VERSION is the version of the model in mlflow
"""

os.environ['MLFLOW_S3_ENDPOINT_URL']='http://minio-ml-workshop:9000'
os.environ['AWS_ACCESS_KEY_ID']='minio'
os.environ['AWS_REGION']='us-east-1'
os.environ['AWS_BUCKET_NAME']='mlflow'

HOST = "http://mlflow:5500"

model_name = os.environ["MODEL_NAME"]
model_version = os.environ["MODEL_VERSION"]
build_name = f"seldon-model-{model_name}-v{model_version}"

auth_encoded = string.Template("$CONTAINER_REGISTRY_USER:$CONTAINER_REGISTRY_PASSWORD").substitute(os.environ)
os.environ["CONTAINER_REGISTRY_CREDS"] = base64.b64encode(auth_encoded.encode("ascii")).decode("ascii")

print(auth_encoded)

docker_auth = string.Template('{"auths":{"$CONTAINER_REGISTRY":{"auth":"$CONTAINER_REGISTRY_CREDS"}}}').substitute(os.environ)
print(docker_auth)
f = open("/kaniko/.docker/config.json", "w")
f.write(docker_auth)
f.close()

def get_s3_server():
    minioClient = Minio('minio-ml-workshop:9000',
                        access_key=os.environ['AWS_ACCESS_KEY_ID'],
                        secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
                        secure=False)

    return minioClient

def init():
    mlflow.set_tracking_uri(HOST)

def download_artifacts():
    print("retrieving model metadata from mlflow...")
    # model = mlflow.pyfunc.load_model(
    #     model_uri=f"models:/{model_name}/{model_version}"
    # )
    client = MlflowClient()

    model = client.get_registered_model(model_name)

    print(model)

    run_id = model._latest_version[0].run_id
    source = model._latest_version[0].source
    experiment_id = "1" # to be calculated from the source which is source='s3://mlflow/1/bf721e5641394ed6866baf20131fca20/artifacts/model'

    print("initializing connection to s3 server...")
    minioClient = get_s3_server()

    #     artifact_location = mlflow.get_experiment_by_name('rossdemo').artifact_location
    #     print("downloading artifacts from s3 bucket " + artifact_location)

    data_file_model = minioClient.fget_object("mlflow", f"/{experiment_id}/{run_id}/artifacts/model/model.pkl", "model.pkl")
    data_file_requirements = minioClient.fget_object("mlflow", f"/{experiment_id}/{run_id}/artifacts/model/requirements.txt", "requirements.txt")

    #Using boto3 Download the files from mlflow, the file path is in the model meta
    #write the files to the file system
    print("download successful")

    return run_id

def build_push_image():
    container_location = string.Template("$CONTAINER_REGISTRY/$CONTAINER_DETAILS").substitute(os.environ)

    #For docker repo, do not include the registry domain name in container location
    if os.environ["CONTAINER_REGISTRY"].find("docker.io") != -1:
        container_location= os.environ["CONTAINER_DETAILS"]

    full_command = "/kaniko/executor --context=" + os.getcwd() + " --dockerfile=Dockerfile --verbosity=debug --cache=true --single-snapshot=true --destination=" + container_location
    print(full_command)
    process = subprocess.run(full_command, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(process.stdout)
    print(process.stderr)

     print(subprocess.check_output(['/kaniko/executor', '--context', '/workspace',  '--dockerfile', 'Dockerfile', '--destination', container_location]))

init()
download_artifacts()
build_push_image()
aquynh1682 commented 1 year ago

I just checked again, and I realized that the issue is not with the Python file itself. It seems to be related to the subprocess module. It is unable to execute the Python file.

aquynh1682 commented 1 year ago

From my investigation, it does seem that the issue lies in this Python file itself. But how can I determine what exactly is causing the error?

webmakaka commented 1 year ago

Hi, @aquynh1682 Can you execute?

$ kubectl get pods -n ml-workshop

I'm just asking and i think i will not help.

aquynh1682 commented 1 year ago

Hi, @webmakaka Thank you for asking. Here is the result when running the command: $ kubectl get pods -n ml-workshop

image

webmakaka commented 1 year ago

Did you change original manifests? Or everything works from the box? (I had issues last time)

=============================

Did you create empy your airflow-dags repo with branch main? And did you generate TOKEN to push dags to github?

========= May be my notes could be helpful. https://github.com/webmakaka/Machine-Learning-on-Kubernetes/blob/main/docs/07-model-deployment-and-automation.md

aquynh1682 commented 1 year ago

Hi @webmakaka,

Yes, I have modified the manifests so that it can work properly now. And where are you issues the problem?

===========

Yes, I have completed all those steps and I can see that it's still working fine. Currently, I suspect that the error might be due to the inability to push images to my registry using Kaniko.

Image github àilow-dags repo with branch main: image

Image airflow sync to github: image

==============

I have checked, and currently, it is not helpful to me either. :)))

webmakaka commented 1 year ago

ok! Can you share your manifests and information about your versions of kubernetes, operator-lifecycle-manager?

I had issue with running airflow.

https://github.com/PacktPublishing/Machine-Learning-on-Kubernetes/issues/10#issuecomment-1397792510

aquynh1682 commented 1 year ago

I am using kubernetes version 1.26.6 and olm version v0.20.0.

Oh before, I also encountered a similar error. I tried to check, and it showed an error connecting to the database. Please try running this command to see what it reports: kubectl logs -f -n ml-workshop <name pod> -c <container name>.

aquynh1682 commented 1 year ago

As I just tried running it locally before running it on Airflow, I found that it reported an error of missing the config.json file inside the /kaniko/.docker/config.json folder. And when I tried accessing the images quay.io/ml-on-k8s/kaniko-container-builder:1.0.0, ironically, it didn't have the necessary files to run (there wasn't even a Dockerfile in /workspace). However, in the build_push_image.py Python file, it demands those files, lol :)))). I think the Airflow part of Chapter 7 should be temporarily skipped until a more reliable image version is available (or maybe never).

Images not find config.json and dockerfile:

image

Images file python open file config.json and dockerfile:

image

image

webmakaka commented 1 year ago

Same problem as here?

aquynh1682 commented 1 year ago

That's right.

webmakaka commented 1 year ago

Maybe you should check this file in? quay.io/ml-on-k8s/kaniko-container-builder:1.0.0

aquynh1682 commented 1 year ago

Due to my incomplete screenshots, I used the image quay.io/ml-on-k8s/kaniko-container-builder:1.0.0 and it was completely missing.

image

webmakaka commented 1 year ago

I think /kaniko/.docker/config.jso file will be created on auth to your registry. My local config.json, for example

Screenshot from 2023-06-29 11-00-46

You specify some creds in Environment Variables

MODEL_NAME=mlflowdemo
MODEL_VERSION=1
CONTAINER_REGISTRY=https://index.docker.io/v1/
CONTAINER_REGISTRY_USER=username
CONTAINER_REGISTRY_PASSWORD=mypassword
CONTAINER_DETAILS=webmakaka/mlflowdemo:latest

Everything worked a year ago for this chapter. May be there was some mistakes in the book (I do not remember) and everything will work after reading next chapter.

aquynh1682 commented 1 year ago

Hmm, I will try to run it again and investigate why I encountered the error. Thank you very much.

aquynh1682 commented 1 year ago

I just made a discovery that when running a Python file error, the log will be pushed to Minio. After having this file, I was able to complete Chapter 9. However, Chapter 7 is similar, but it has more errors that I'm quite lazy to fix. So, if anyone has completed Chapter 7 completely, please guide me through it. :))))

image

image

webmakaka commented 1 year ago

Hi, @aquynh1682

Can you show information about opendatahub-operator from your stand?

I have issue with it. It updates on versions with errors.

$ kubectl get pods -n operators
NAME                                                       READY   STATUS             RESTARTS      AGE
opendatahub-operator-controller-manager-79f79b7b5f-9vv9d   1/2     CrashLoopBackOff   6 (46s ago)   9m33s
$ kubectl logs opendatahub-operator-controller-manager-79f79b7b5f-9vv9d -n operators
2023-06-30T10:17:59.483Z    INFO    controller-runtime.metrics  metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2023-06-30T10:17:59.483Z    INFO    controllers.KfDef   Adding controller for kfdef.
2023-06-30T10:17:59.484Z    INFO    secret-generator    Adding controller for Secret Generation.
2023-06-30T10:17:59.484Z    INFO    setup   starting manager
I0630 10:17:59.484476       1 leaderelection.go:243] attempting to acquire leader lease operators/kfdef-controller...
2023-06-30T10:17:59.484Z    INFO    starting metrics server {"path": "/metrics"}
I0630 10:18:15.245800       1 leaderelection.go:253] successfully acquired lease operators/kfdef-controller
2023-06-30T10:18:15.246Z    INFO    controller.secret-generator-controller  Starting EventSource    {"reconciler group": "", "reconciler kind": "Secret", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.secret-generator-controller  Starting EventSource    {"reconciler group": "", "reconciler kind": "Secret", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.secret-generator-controller  Starting Controller {"reconciler group": "", "reconciler kind": "Secret"}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.245Z    DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"operators","name":"kfdef-controller","uid":"f0fac8fa-4795-4030-89da-b1a5b598b3ac","apiVersion":"v1","resourceVersion":"10172"}, "reason": "LeaderElection", "message": "opendatahub-operator-controller-manager-79f79b7b5f-9vv9d_015da768-d792-4f12-89a2-0a7a8ddf06ec became leader"}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"operators","name":"kfdef-controller","uid":"573eae8d-c509-4a62-ad40-f2492672abd9","apiVersion":"coordination.k8s.io/v1","resourceVersion":"10173"}, "reason": "LeaderElection", "message": "opendatahub-operator-controller-manager-79f79b7b5f-9vv9d_015da768-d792-4f12-89a2-0a7a8ddf06ec became leader"}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting EventSource    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "source": "kind source: /, Kind="}
2023-06-30T10:18:15.246Z    INFO    controller.kfdef-controller Starting Controller {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef"}
2023-06-30T10:18:16.106Z    ERROR   controller-runtime.source   if kind is a CRD, it should be installed before calling Start   {"kind": "BuildConfig.build.openshift.io", "error": "no matches for kind \"BuildConfig\" in version \"build.openshift.io/v1\""}
2023-06-30T10:18:16.106Z    INFO    controller.secret-generator-controller  Starting workers    {"reconciler group": "", "reconciler kind": "Secret", "worker count": 1}
2023-06-30T10:18:16.106Z    INFO    controllers.KfDef   Watch a change for KfDef CR {"instance": "opendatahub-ml-workshop", "namespace": "ml-workshop"}
I0630 10:18:17.252216       1 request.go:668] Waited for 1.047053266s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
2023-06-30T10:18:18.455Z    ERROR   controller-runtime.source   if kind is a CRD, it should be installed before calling Start   {"kind": "DeploymentConfig.apps.openshift.io", "error": "no matches for kind \"DeploymentConfig\" in version \"apps.openshift.io/v1\""}
2023-06-30T10:18:18.455Z    ERROR   controller.kfdef-controller Could not wait for Cache to sync    {"reconciler group": "kfdef.apps.kubeflow.org", "reconciler kind": "KfDef", "error": "failed to wait for kfdef-controller caches to sync: no matches for kind \"DeploymentConfig\" in version \"apps.openshift.io/v1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/manager/internal.go:696
2023-06-30T10:18:18.455Z    INFO    controller.secret-generator-controller  Shutdown signal received, waiting for all workers to finish {"reconciler group": "", "reconciler kind": "Secret"}
2023-06-30T10:18:18.455Z    INFO    controller.secret-generator-controller  All workers finished    {"reconciler group": "", "reconciler kind": "Secret"}
2023-06-30T10:18:18.455Z    ERROR   setup   problem running manager {"error": "failed to wait for kfdef-controller caches to sync: no matches for kind \"DeploymentConfig\" in version \"apps.openshift.io/v1\""}
runtime.goexit
    /usr/lib/golang/src/runtime/asm_amd64.s:1571
2023-06-30T10:18:18.455Z    ERROR   error received after stop sequence was engaged  {"error": "leader election lost"}
runtime.goexit
    /usr/lib/golang/src/runtime/asm_amd64.s:1571
aquynh1682 commented 1 year ago

Hi, @webmakaka

I used opendatahub-operator.v.1.1.1

My trick is first setting it up to run in automation mode, and then quickly changing it back to manual mode so that it functions normally =)))).

image

image

webmakaka commented 1 year ago

Hi, @aquynh1682 !

I completed this step.

pic-07-1

pic-07-2

pic-07-3

pic-07-4

To pass this step, I edited build_push_image.py and manually specified:

    access_key="minio",
    secret_key="minio123",

and from minio in mlflow bucket i take.

 data_file_model = minioClient.fget_object("mlflow", f"/2/726460da00bc4bedb7f70f20e08bc3b3/artifacts/model/model.pkl", "model.pkl")

 data_file_requirements = minioClient.fget_object("mlflow", f"/2/726460da00bc4bedb7f70f20e08bc3b3/artifacts/model/requirements.txt", "requirements.txt")

I think you can manually specify your parameters, at least MODEL_NAME in line 27

aquynh1682 commented 1 year ago

Hi, @webmakaka

I apologize for the delayed response. I am currently facing another issue related to kaniko. I am quite certain that I am doing everything correctly, but could you please let me know which registry you are using and ít API endpoint when you fill in the Jupiter Hub?

Log error push image to registry:

{"auths":{"https://hub.docker.com/v2":{"auth":"cXV5bmhuZ28xMTM6UXV5bmhscDEyMzQ1NmFA"}}}
retrieving model metadata from mlflow...
<RegisteredModel: creation_timestamp=1688097597263, description='', last_updated_timestamp=1688097632383, latest_versions=[<ModelVersion: creation_timestamp=1688097632383, current_stage='None', description='', last_updated_timestamp=1688097632383, name='mlflowdemo', run_id='42b2e5c665864ab48f7979ded26673f5', run_link='', source='s3://mlflow/1/42b2e5c665864ab48f7979ded26673f5/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>], name='mlflowdemo', tags={}>
initializing connection to s3 server...
download successful
/workspace/jupyter-work-dir
/kaniko/executor --context=/workspace/jupyter-work-dir --dockerfile=Dockerfile --verbosity=debug --cache=true --single-snapshot=true --destination=https://hub.docker.com/v2/quynhngo113/mlflowdemo:latest
===============
b'error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "https://hub.docker.com/v2/quynhngo113/mlflowdemo:latest": creating push check transport for https: failed: Get "https://https/v2/": dial tcp: lookup https on 169.254.25.10:53: no such host\n'
===============
b'\x1b[37mDEBU\x1b[0m[0000] Copying file /workspace/jupyter-work-dir/Dockerfile to /kaniko/Dockerfile \n'
webmakaka commented 1 year ago

My Runtimes is

Name: MyAirflow
Description: MyAirflow

Apache Airflow UI Endpoint: https://airflow.192.168.49.2.nip.io
Apache Airflow User Namespace: ml-workshop
Github API Endpoint: https://api.github.com
GitHub DAG Repository: wildmakaka/airflow-dags
GitHub DAG Repository Branch: main
Github Personal Access Token: [YOUR_GITHUB_TOKEN]

Cloud Object Storage Endpoint: http://minio-ml-workshop:9000
Cloud Object Storage Credential Secret: [empty]
Cloud Object Storage Username: minio
Cloud Object Storage Password: minio123
Cloud Object Storage Bucket Name: airflow

I think, your final value https://hub.docker.com/v2/quynhngo113/mlflowdemo:latest is incorrect.

aquynh1682 commented 1 year ago

My Runtimes it completely normal.

I was just asking about the declaration of variable passed into the build_push_image.py file.

image

webmakaka commented 1 year ago

Looks correct! May be there are problems with passing values in the script? Can you manually set values in the build_push_image.py?

webmakaka commented 1 year ago

I have problems with runnung spark scripts. Old problem returns.

aquynh1682 commented 1 year ago

I also encountered a similar error, and I just ignored it :)))

aquynh1682 commented 1 year ago

I was manually set values in the build_push_image.py but it still gives an error.

This is the build_push_image.py:

import string
import subprocess
import os
import base64
import mlflow
from minio import Minio
from mlflow.tracking import MlflowClient

"""
    This script assumes that the /kaniko/.docker/config.json has the correct repo and associated credentials mounted
    It also expects the these env variables has been set
    CONTAINER_REGISTRY is the resitry server like quay.io
    CONTAINER_DETAILS is the container coordinates like ml-on-k8s/containermodel:1.0.0
    AWS_SECRET_ACCESS_KEY is the password for the S3 store
    MODEL_NAME is hte name of the model in mlflow
    MODEL_VERSION is the version of the model in mlflow
"""

os.environ['MLFLOW_S3_ENDPOINT_URL']='http://minio-ml-workshop:9000'
os.environ['AWS_ACCESS_KEY_ID']='minio'
os.environ['AWS_REGION']='us-east-1'
os.environ['AWS_BUCKET_NAME']='mlflow'

HOST = "http://mlflow:5500"

model_name = "mlflowdemo"
model_version = 1
build_name = f"seldon-model-{model_name}-v{model_version}"

auth_encoded = string.Template(":").substitute(os.environ)
os.environ["CONTAINER_REGISTRY_CREDS"] = base64.b64encode(auth_encoded.encode("ascii")).decode("ascii")

docker_auth = string.Template('{"auths":{"https://quay.io/api/v1":{"auth":"$CONTAINER_REGISTRY_CREDS"}}}').substitute(os.environ)
print(docker_auth)
f = open("/kaniko/.docker/config.json", "w")
f.write(docker_auth)
f.close()

def get_s3_server():
    minioClient = Minio('minio-ml-workshop:9000',
                        # access_key=os.environ['AWS_ACCESS_KEY_ID'],
                        # secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
                        access_key="minio",
                        secret_key="minio123",
                        secure=False)

    return minioClient

def init():
    mlflow.set_tracking_uri(HOST)

def download_artifacts():
    print("retrieving model metadata from mlflow...")
    # model = mlflow.pyfunc.load_model(
    #     model_uri=f"models:/{model_name}/{model_version}"
    # )
    client = MlflowClient()

    model = client.get_registered_model(model_name)

    print(model)

    run_id = model._latest_version[0].run_id
    source = model._latest_version[0].source
    experiment_id = "1" # to be calculated from the source which is source='s3://mlflow/1/bf721e5641394ed6866baf20131fca20/artifacts/model'

    print("initializing connection to s3 server...")
    minioClient = get_s3_server()

    #     artifact_location = mlflow.get_experiment_by_name('rossdemo').artifact_location
    #     print("downloading artifacts from s3 bucket " + artifact_location)

    data_file_model = minioClient.fget_object("mlflow", f"/{experiment_id}/{run_id}/artifacts/model/model.pkl", "model.pkl")
    data_file_requirements = minioClient.fget_object("mlflow", f"/{experiment_id}/{run_id}/artifacts/model/requirements.txt", "requirements.txt")

    #Using boto3 Download the files from mlflow, the file path is in the model meta
    #write the files to the file system
    print("download successful")

    return run_id

def build_push_image():
    container_location = string.Template("https://quay.io/api/v1/quynhngo113/mlflowdemo:1.1.0").substitute(os.environ)

    #For docker repo, do not include the registry domain name in container location
    # if os.environ["https://hub.docker.com/v2"].find("docker.io") != -1:
    #     container_location= os.environ["quynhngo113/mlflowdemo:1.1.0"]
    # print(os.getcwd())
    full_command = "/kaniko/executor --context=" + os.getcwd() + " --dockerfile=Dockerfile --verbosity=debug --cache=true --single-snapshot=true --destination=" + container_location
    print(full_command)
    process = subprocess.run(full_command, shell=True, check=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print("===============")
    print(process.stdout)
    print("===============")
    print(process.stderr)

    # print(subprocess.check_output(['/kaniko/executor', '--context', '/workspace',  '--dockerfile', 'Dockerfile', '--destination', container_location]))

init()
download_artifacts()
build_push_image()

This error:

{"auths":{"https://quay.io/api/v1":{"auth":"cXV5bmhuZ28xMTM6UXV5bmhscDEyMzQ1NmFA"}}}
retrieving model metadata from mlflow...
<RegisteredModel: creation_timestamp=1688097597263, description='', last_updated_timestamp=1688097632383, latest_versions=[<ModelVersion: creation_timestamp=1688097632383, current_stage='None', description='', last_updated_timestamp=1688097632383, name='mlflowdemo', run_id='42b2e5c665864ab48f7979ded26673f5', run_link='', source='s3://mlflow/1/42b2e5c665864ab48f7979ded26673f5/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='1'>], name='mlflowdemo', tags={}>
initializing connection to s3 server...
download successful
/kaniko/executor --context=/workspace/jupyter-work-dir --dockerfile=Dockerfile --verbosity=debug --cache=true --single-snapshot=true --destination=https://quay.io/api/v1/quynhngo113/mlflowdemo:1.1.0
===============
b'error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "https://quay.io/api/v1/quynhngo113/mlflowdemo:1.1.0": creating push check transport for https: failed: Get "https://https/v2/": dial tcp: lookup https on 169.254.25.10:53: no such host\n'
===============
b'\x1b[37mDEBU\x1b[0m[0000] Copying file /workspace/jupyter-work-dir/Dockerfile to /kaniko/Dockerfile \n'
webmakaka commented 1 year ago

Hi, @aquynh1682! Did you finish reading this book?

I have issue with load data in grafana in chapter 10. May be you can recommend something?

pic-10-2

aquynh1682 commented 1 year ago

H, @webmakaka!

I have finished reading the book, and there are a few things that made me skip :)).

As for Grafana not being able to fetch metrics, I have read the file manifests/prometheus/base/prometheus.yaml. So please find the spec.Selector.MatchLabels section in the ServiceMonitor part. You will see it has the label app.kubernetes.io/managed-by: seldon-core. Earlier, I tried to find it, but my Seldon was not there, I think it's because I skipped it before the Grafana part. So it couldn't start the images built from that part. Please try running this part before Grafana and see if the started pod of the Seldon app has something related to seldon-core. (I haven't had a chance to test this part, my Kubernetes cluster got deleted lol :)))).

Wish you can conquer the last part of the book 🫡

image

rajat-packt commented 1 year ago

Hey @aquynh1682 and @webmakaka, Have you guys been able to figure out your issues, if yes, can I close this thread?

If you guys still need help with something, please share the details of the error you are facing, and I will try to reach out to the author for assistance. You can create a separate issue threads too in case of multiple issues.

webmakaka commented 1 year ago

Hi, @rajat-packt ! I have no issue with this chapter. Can you ask author to share sources for images. I want try to build images by myself.

aquynh1682 commented 1 year ago

Hi, @rajat-packt, I have an issue in chapters 7 and 10. From what I can see, they both have the same error related to Kaniko. Here is the error log file when running it with the Python file build_push_images.py. Please contact me if you need any additional information to help fix this issue. Thank you very much.

build_push_image (1).log

And here is the content of that Python file.

import string
import subprocess
import os
import base64
import mlflow
from minio import Minio
from mlflow.tracking import MlflowClient

"""
    This script assumes that the /kaniko/.docker/config.json has the correct repo and associated credentials mounted
    It also expects the these env variables has been set
    CONTAINER_REGISTRY is the resitry server like quay.io
    CONTAINER_DETAILS is the container coordinates like ml-on-k8s/containermodel:1.0.0
    AWS_SECRET_ACCESS_KEY is the password for the S3 store
    MODEL_NAME is hte name of the model in mlflow
    MODEL_VERSION is the version of the model in mlflow
"""

os.environ['MLFLOW_S3_ENDPOINT_URL']='http://minio-ml-workshop:9000'
os.environ['AWS_ACCESS_KEY_ID']='minio'
os.environ['AWS_REGION']='us-east-1'
os.environ['AWS_BUCKET_NAME']='mlflow'

HOST = "http://mlflow:5500"

model_name = os.environ["MODEL_NAME"]
model_version = os.environ["MODEL_VERSION"]
build_name = f"seldon-model-{model_name}-v{model_version}"

auth_encoded = string.Template("$CONTAINER_REGISTRY_USER:$CONTAINER_REGISTRY_PASSWORD").substitute(os.environ)
os.environ["CONTAINER_REGISTRY_CREDS"] = base64.b64encode(auth_encoded.encode("ascii")).decode("ascii")

docker_auth = string.Template('{"auths":{"$CONTAINER_REGISTRY":{"auth":"$CONTAINER_REGISTRY_CREDS"}}}').substitute(os.environ)
print(docker_auth)
f = open("/kaniko/.docker/config.json", "w")
f.write(docker_auth)
f.close()

def get_s3_server():
    minioClient = Minio('minio-ml-workshop:9000',
                        access_key=os.environ['AWS_ACCESS_KEY_ID'],
                        secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
                        secure=False)

    return minioClient

def init():
    mlflow.set_tracking_uri(HOST)

def download_artifacts():
    print("retrieving model metadata from mlflow...")
    # model = mlflow.pyfunc.load_model(
    #     model_uri=f"models:/{model_name}/{model_version}"
    # )
    client = MlflowClient()

    model = client.get_registered_model(model_name)

    print(model)

    run_id = model._latest_version[0].run_id
    source = model._latest_version[0].source
    experiment_id = "1" # to be calculated from the source which is source='s3://mlflow/1/bf721e5641394ed6866baf20131fca20/artifacts/model'

    print("initializing connection to s3 server...")
    minioClient = get_s3_server()

    #     artifact_location = mlflow.get_experiment_by_name('rossdemo').artifact_location
    #     print("downloading artifacts from s3 bucket " + artifact_location)

    data_file_model = minioClient.fget_object("mlflow", f"/{experiment_id}/{run_id}/artifacts/model/model.pkl", "model.pkl")
    data_file_requirements = minioClient.fget_object("mlflow", f"/{experiment_id}/{run_id}/artifacts/model/requirements.txt", "requirements.txt")

    #Using boto3 Download the files from mlflow, the file path is in the model meta
    #write the files to the file system
    print("download successful")

    return run_id

def build_push_image():
    container_location = string.Template("$CONTAINER_REGISTRY/$CONTAINER_DETAILS").substitute(os.environ)

    #For docker repo, do not include the registry domain name in container location
    if os.environ["CONTAINER_REGISTRY"].find("docker.io") != -1:
        container_location= os.environ["CONTAINER_DETAILS"]

    full_command = "/kaniko/executor --context=" + os.getcwd() + " --dockerfile=Dockerfile --verbosity=debug --cache=true --single-snapshot=true --destination=" + container_location
    print(full_command)
    process = subprocess.run(full_command, shell=True, check=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(process.stdout)
    print(process.stderr)

    # print(subprocess.check_output(['/kaniko/executor', '--context', '/workspace',  '--dockerfile', 'Dockerfile', '--destination', container_location]))

init()
download_artifacts()
build_push_image()
rajat-packt commented 1 year ago

Hey @aquynh1682, sorry we couldn't provide any assistance as we weren't able get any response from the author, at the moment. I hope you were able to find a solution for this issue.

aquynh1682 commented 1 year ago

Hi @rajat-packt, thanks for trying to help me. I have resolved all the issue that i have encountered.