argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.08k stars 3.2k forks source link

Resource template logs for spark application don't get archived in artifact repo #9900

Open Freia3 opened 2 years ago

Freia3 commented 2 years ago

Pre-requisites

What happened/what you expected to happen?

I have an Argo Workflow running a Spark application (using the spark-operator). I want to archive the logs of this workflow in an artifact repository, but this does not work.

When running the hello-world workflow, the logs do get archived. yaml files to reproduce: https://github.com/Freia3/argo-spark-example

Version

v3.4.2

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: spark-kubernetes-dag
  namespace: freia
spec:
  entrypoint: sparkling-operator
  serviceAccountName: argo-spark
  templates:
  - name: sparkpi
    resource: 
      action: create 
      successCondition: status.applicationState.state in (COMPLETED)
      failureCondition: 'status.applicationState.state in (FAILED, SUBMISSION_FAILED, UNKNOWN)'
      manifest: | 
        apiVersion: "sparkoperator.k8s.io/v1beta2"
        kind: SparkApplication
        metadata:
          generateName: spark-pi
          namespace: freia
        spec:
          type: Scala
          mode: cluster
          image: "gcr.io/spark-operator/spark:v3.0.0"
          imagePullPolicy: Always
          mainClass: org.apache.spark.examples.SparkPi
          mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar"
          sparkVersion: "3.0.0"
          restartPolicy:
            type: Never
          driver:
            memory: "512m"
            labels:
              version: 3.0.0
            serviceAccount: my-release-spark
          executor:
            instances: 1
            memory: "512m"
            labels:
              version: 3.0.0
  - name: sparkling-operator
    dag:
      tasks:
      - name: SparkPi1
        template: sparkpi

Logs from the workflow controller

time="2022-10-24T14:23:06.533Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="Updated phase  -> Running" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="DAG node spark-kubernetes-dagbkjgn initialized Running" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="All of node spark-kubernetes-dagbkjgn.SparkPi1 dependencies [] completed" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="Pod node spark-kubernetes-dagbkjgn-3694106157 initialized Pending" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.601Z" level=info msg="Created pod: spark-kubernetes-dagbkjgn.SparkPi1 (spark-kubernetes-dagbkjgn-sparkpi-3694106157)" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.602Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.602Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.616Z" level=info msg="Workflow update successful" namespace=freia phase=Running resourceVersion=27138244 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="node changed" namespace=freia new.message= new.phase=Running new.progress=0/1 nodeID=spark-kubernetes-dagbkjgn-3694106157 old.message= old.phase=Pending old.progress=0/1 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.604Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.604Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.634Z" level=info msg="Workflow update successful" namespace=freia phase=Running resourceVersion=27138341 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="node unchanged" namespace=freia nodeID=spark-kubernetes-dagbkjgn-3694106157 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node changed" namespace=freia new.message= new.phase=Succeeded new.progress=0/1 nodeID=spark-kubernetes-dagbkjgn-3694106157 old.message= old.phase=Running old.progress=0/1 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Outbound nodes of spark-kubernetes-dagbkjgn set to [spark-kubernetes-dagbkjgn-3694106157]" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node spark-kubernetes-dagbkjgn phase Running -> Succeeded" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node spark-kubernetes-dagbkjgn finished: 2022-10-24 14:27:02.049826737 +0000 UTC" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Checking daemoned children of spark-kubernetes-dagbkjgn" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Updated phase Running -> Succeeded" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Marking workflow completed" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Checking daemoned children of " namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.055Z" level=info msg="cleaning up pod" action=deletePod key=freia/spark-kubernetes-dagbkjgn-1340600742-agent/deletePod
time="2022-10-24T14:27:02.067Z" level=info msg="Workflow update successful" namespace=freia phase=Succeeded resourceVersion=27139900 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.078Z" level=info msg="cleaning up pod" action=labelPodCompleted key=freia/spark-kubernetes-dagbkjgn-sparkpi-3694106157/labelPodCompleted

Logs from in your workflow's wait container

No resources found in argo namespace.

ajkaanbal commented 2 years ago

Your resource needs some metadata, take a look at the example:

https://github.com/argoproj/argo-workflows/blob/master/examples/k8s-resource-log-selector.yaml

Freia3 commented 2 years ago

@ajkaanbal This is for pulling the logs from the pods created by the spark CRD (spark-driver, spark-executor) In the Argo UI I see these logs: image I want to be able to archive those logs.

sarabala1979 commented 2 years ago

@Freia3 Current Resource template will not support archiving the log. Do you like to work on this enhancement?

Freia3 commented 2 years ago

@sarabala1979 Ok, thanks for the information, couldn't find this in the docs. No, I can't work on this enhancement.

arnoin commented 1 year ago

@sarabala1979 Hello I wish to contribute on this one, It is relevant for my team and I believe that the fix is straightforward.

From what Ive seen there are to ways to solve it,

  1. Make the argoexec resource store the logs as artifact using executor.WorkflowExecutor.SaveLogs - I still need to check if it will be able to store its own container logs while still running
  2. Initiate wait container for resource pods which by design stores logs of the main container here - I worry about redundant error reporting but I think it is safe

tbh I think 2 is a better option, wdyt?

Joibel commented 11 months ago

I'm probably in agreement about 2 being the right way to do it. @sarabala1979, can you pitch in?

arnoin commented 10 months ago

Hey @sarabala1979 do you want me to create pull request?

roofurmston commented 5 months ago

Hi

We would be interested in this functionality too. I wonder whether there are any updates on it?

Also, to clarify, we would in interested in getting the logs for containers other than main, right? The logs for main are not particularly useful for ResourceTemplate resources, so we have a sidecar to surface the logs from the external resource. We would be looking to get these sidecar logs archived too.