Open AxoyTO opened 2 weeks ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6494.
This message was autogenerated
First of all I managed to reproduce this by
The underlying Argo Workflow was fully deleted, alongside all the Pipeline Pods in the user namespace. Then in the UI I would see the error above
Looking at the kfp-persistence
logs I saw the following:
2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:55.600Z [persistenceagent] time="2024-10-29T08:51:55Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
But in upstream logs from ml-pipeline-persistenceagent
pod I see the following:
time="2024-10-30T14:33:37Z" level=info msg="Wait for shut down"
time="2024-10-30T14:33:38Z" level=info msg="Syncing Workflow (tutorial-data-passing-xgj6p): success, processing complete." Workflow=tutorial-data-passing-xgj6p
time="2024-10-30T14:33:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:33:40Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:33:40Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:38Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:35:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:35:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
What is weird is that in the Charmed pod we see multiple times the Syncing Workflow...
message, while in upstream component we only see this once. But am not entirely sure if this is related to the issue.
Also with a quick research in the upstream project I see a similar issue raised https://github.com/kubeflow/pipelines/issues/8935
But that seems to be resolved, and the PR of the fix is merged in KFP 2.2.0, which is used by Kubeflow 1.9.
Bug Description
After 24 hours of creation, all logs belonging to a pipeline run disappear from the Charmed Kubeflow UI, despite the logs still being present in MinIO/mlpipeline (AWS S3). This leads to difficulty in troubleshooting and tracking the progress or failures of pipeline runs after the 24-hour period.
To Reproduce
Environment
CKF: 1.9/stable minio: ckf-1.9/stable argo-controller: 3.4/stable Juju: 3.5.4 See the full bundle on: https://paste.ubuntu.com/p/NXXFhDqmVn/
Relevant Log Output
Additional Context
Notebook that is used to create a pipeline, which was ran on a notebook server with a GPU:
Could be related to upstream: https://github.com/kubeflow/pipelines/issues/7617