canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

Pipeline logs are disappearing after 24h #1120

Open AxoyTO opened 2 weeks ago

AxoyTO commented 2 weeks ago

Bug Description

After 24 hours of creation, all logs belonging to a pipeline run disappear from the Charmed Kubeflow UI, despite the logs still being present in MinIO/mlpipeline (AWS S3). This leads to difficulty in troubleshooting and tracking the progress or failures of pipeline runs after the 24-hour period.

image

!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline
                           [...]
                           PRE addition-pipeline-4g94d/
                           PRE addition-pipeline-4qwt4/
                           PRE download-preprocess-train-deploy-pipeline-8wjv9/
                           PRE mnist-pipeline-fcmgr/
                           [...]
!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline/download-preprocess-train-deploy-pipeline-8wjv9/download-preprocess-train-deploy-pipeline-8wjv9-system-container-impl-1190848556/
2024-10-15 15:27:49      10796 main.log

To Reproduce

  1. Deploy Charmed Kubeflow 1.9 using Juju.
  2. Create a pipeline and run it.
  3. After the run completes, observe that logs are available in the Kubeflow UI.
  4. Wait for 24 hours after the pipeline run completes.
  5. Attempt to view the pipeline logs in the UI again. Expected: Logs should still be accessible. Actual: Logs are no longer visible in the UI, but are still present in the underlying MinIO/mlpipeline (AWS S3).

Environment

CKF: 1.9/stable minio: ckf-1.9/stable argo-controller: 3.4/stable Juju: 3.5.4 See the full bundle on: https://paste.ubuntu.com/p/NXXFhDqmVn/

Relevant Log Output

<none>

Additional Context

Notebook that is used to create a pipeline, which was ran on a notebook server with a GPU:

import kfp
from kfp import dsl, kubernetes

@dsl.component(
    base_image="tensorflow/tensorflow:latest-gpu",
    # packages_to_install=["tensorflow"]
)
def foo():
    '''Calculates sum of two arguments'''
    print("GPU Test")
    import tensorflow as tf
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    print("GPU Test")

@dsl.pipeline(
    name='Addition pipeline',
    description='An example pipeline that performs addition calculations.')
def foo_pipeline():
    task = (foo()
            .set_cpu_request(str(2))
            .set_cpu_limit(str(4))
            .set_memory_request("2Gi")
            .set_memory_limit("4Gi")
            .set_gpu_limit("1")
            .set_accelerator_type("nvidia.com/gpu")
           )

    task = kubernetes.set_image_pull_policy(task=task, policy="Always")

    task = kubernetes.add_toleration(
        task=task,
        key="sku",
        operator="Equal",
        value="gpu",
        effect="NoSchedule",
    )

namespace = "admin"

client = kfp.Client()

run = client.create_run_from_pipeline_func(
    run_name="gpu_test",
    pipeline_func=foo_pipeline,
    namespace=namespace,
    experiment_name="gpu-foo")

Could be related to upstream: https://github.com/kubeflow/pipelines/issues/7617

syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6494.

This message was autogenerated

kimwnasptd commented 2 days ago

First of all I managed to reproduce this by

  1. Creating a run
  2. Waiting for about a day

The underlying Argo Workflow was fully deleted, alongside all the Pipeline Pods in the user namespace. Then in the UI I would see the error above

kimwnasptd commented 2 days ago

Looking at the kfp-persistence logs I saw the following:

2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:55.600Z [persistenceagent] time="2024-10-29T08:51:55Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj

But in upstream logs from ml-pipeline-persistenceagent pod I see the following:

time="2024-10-30T14:33:37Z" level=info msg="Wait for shut down"
time="2024-10-30T14:33:38Z" level=info msg="Syncing Workflow (tutorial-data-passing-xgj6p): success, processing complete." Workflow=tutorial-data-passing-xgj6p
time="2024-10-30T14:33:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:33:40Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:33:40Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:38Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:35:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:35:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"

What is weird is that in the Charmed pod we see multiple times the Syncing Workflow... message, while in upstream component we only see this once. But am not entirely sure if this is related to the issue.

kimwnasptd commented 2 days ago

Also with a quick research in the upstream project I see a similar issue raised https://github.com/kubeflow/pipelines/issues/8935

But that seems to be resolved, and the PR of the fix is merged in KFP 2.2.0, which is used by Kubeflow 1.9.