argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15k stars 3.2k forks source link

Not working with LibreOffice #9117

Closed amosaini closed 2 years ago

amosaini commented 2 years ago

Summary

What happened/what you expected to happen? I am need to use libreoffice headless to convert docx file to pdf. This is working execellent in Vanilla k8s and Databricks but when i do the same in Kubeflow which uses argo workflow at its backend it does not produce any output.

What version are you running? argoproj.io/v1alpha1 Kubeflow 1.4

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: libreoffice-pv-claim
spec:
  storageClassName: gp2
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: libreoffice
spec:
  containers:
    - name: libreoffice-container
      image: domnulnopcea/libreoffice-headless:latest
      command: ["libreoffice", "--headless", "--convert-to","pdf" ,"/tests/288.pptx","--outdir", "/tests"]
      volumeMounts:
        - mountPath: "/tests"
          name: libreoffice-storage
  volumes:
    - name: libreoffice-storage
      persistentVolumeClaim:
        claimName: libreoffice-pv-claim
  tolerations:
    - key: project
      operator: Equal
      value: cd-msr
      effect: NoSchedule
---
apiVersion: v1
kind: Pod
metadata:
  name: libreoffice-bash
spec:
  containers:
    - name: libreoffice-container
      image: ubuntu:18.04
      command: ["/bin/sleep", "3650d"]
      volumeMounts:
        - mountPath: "/tests"
          name: libreoffice-storage
  volumes:
    - name: libreoffice-storage
      persistentVolumeClaim:
        claimName: libreoffice-pv-claim
  tolerations:
    - key: project
      operator: Equal
      value: cd-msr
      effect: NoSchedule

This is the yaml I am using. I am then manually copying the input files

kubectl cp ./288.pptx libreoffice-bash:/tests/
kubectl cp ./dummy.pptx libreoffice-bash:/tests/

This is working but when I tries to do the same in Kubeflow it doesn't was. The script executes without producing any output file.

import kfp
import kfp.components as components
import kfp.dsl as dsl
from kfp.components import InputPath, OutputPath

@components.create_component_from_func
def download_file(s3_folder_path,object_name):
    input_file_path=s3_folder_path+"/"+object_name
    import subprocess
    subprocess.run('pip install boto3'.split())
    # Download file
    import boto3
    s3=boto3.client('s3')
    s3.download_file('qa-cd-msr-20220524050318415700000001', input_file_path, '/tmp/input.pptx')
    print(input_file_path + " file is downloaded...Executing libreoffice conversion")
    subprocess.run("ls -ltr /tmp".split())
def convert_to_pdf():
    import subprocess
    def exec_cmd(cmd)->(any,str):
        print("Executing "+cmd)
        result=subprocess.run(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        stdout=result.stdout.decode('utf-8') + '\n'+ result.stderr.decode('utf-8')
        print("stdout: "+stdout)
        return stdout
    exec_cmd("libreoffice --headless --convert-to pdf /files/input.pptx --outdir /files")
    exec_cmd("ls -ltr /files")
convert_to_pdf_op = components.func_to_container_op(convert_to_pdf, base_image= "domnulnopcea/libreoffice-headless:latest") 
@dsl.pipeline(
    name="Libreoffice",
    description="Libreoffice",
)
def sample_pipeline(s3_folder_path:str="/mpsr/decks", object_name:str="Adcetris_master_40.pptx"):
    vop = dsl.VolumeOp(
        name="create-pvc",
        resource_name="my-pvc",
        modes=dsl.VOLUME_MODE_RWO,
        size="1Gi"
    )
    download = download_file(s3_folder_path,object_name).add_pvolumes({"/tmp": vop.volume})
    convert = convert_to_pdf_op().add_pvolumes({"/files": download.pvolume})
    convert.execution_options.caching_strategy.max_cache_staleness = "P0D"
    convert.after(download)
client = kfp.Client()
experiment = client.create_experiment(
    name="Libreoffice", 
    description="Libreoffice",
    namespace="cd-msr"
) 
client.create_run_from_pipeline_func(
    sample_pipeline, 
    arguments={"s3_folder_path":"/mpsr/decks","object_name":"dummy1.pptx"}, 
    run_name="libreoffice", 
    experiment_name="Libreoffice"
)

Output :

image

ignore the error here. I was also getting this in vanilla k8s but it gives the output there.

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

amosaini commented 2 years ago

I switched to docker runtime it works there.

alexec commented 2 years ago

Docker executor is no longer supported. If this does not work with PNS executor, then this is a regression. Have you tried PNS?

alexec commented 2 years ago

Can you please upload the workflow that caused this problem.

amosaini commented 2 years ago

I just tried it with PNS executor. It is succeeding with the PNS executor. I will get back to you with the argo workflow.

amosaini commented 2 years ago

Output of the wf with pns executor:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
    pipelines.kubeflow.org/pipeline_compilation_time: 2022-07-19T10:48:45.175943
    pipelines.kubeflow.org/pipeline_spec: '{"description": "Libreoffice", "inputs":
      [{"default": "/mpsr/decks", "name": "s3_folder_path", "optional": true, "type":
      "String"}, {"default": "Adcetris_master_288.pptx", "name": "object_name", "optional":
      true, "type": "String"}], "name": "Libreoffice"}'
    pipelines.kubeflow.org/run_name: libreoffice
  creationTimestamp: "2022-07-19T10:48:45Z"
  generateName: libreoffice-
  generation: 8
  labels:
    pipeline/persistedFinalState: "true"
    pipeline/runid: 3c88a72f-5b46-46c1-9dd9-9765971611a2
    pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Succeeded
  name: libreoffice-kck7k
  namespace: cd-msr
  resourceVersion: "198021692"
  uid: 0d740423-2c54-4b2a-a55c-b0b8e546fd3f
spec:
  arguments:
    parameters:
    - name: s3_folder_path
      value: /mpsr/decks
    - name: object_name
      value: Adcetris_master_288.pptx
  entrypoint: libreoffice
  podMetadata:
    labels:
      pipeline/runid: 3c88a72f-5b46-46c1-9dd9-9765971611a2
  serviceAccountName: default-editor
  templates:
  - container:
      command:
      - sh
      - -ec
      - |
        program_path=$(mktemp)
        printf "%s" "$0" > "$program_path"
        python3 -u "$program_path" "$@"
      - |
        def convert_to_pdf():
            import subprocess
            def exec_cmd(cmd):
                print("Executing "+cmd)
                result=subprocess.run(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
                stdout=result.stdout.decode('utf-8') + '\n'+ result.stderr.decode('utf-8')
                print("stdout: "+stdout)
                return stdout
            exec_cmd("libreoffice --headless --convert-to pdf /files/input.pptx --outdir /files")
            exec_cmd("ls -ltr /files")

        import argparse
        _parser = argparse.ArgumentParser(prog='Convert to pdf', description='')
        _parsed_args = vars(_parser.parse_args())

        _outputs = convert_to_pdf(**_parsed_args)
      image: domnulnopcea/libreoffice-headless:latest
      name: ""
      resources: {}
      volumeMounts:
      - mountPath: /files
        name: create-pvc
    inputs:
      parameters:
      - name: create-pvc-name
    metadata:
      annotations:
        pipelines.kubeflow.org/component_ref: '{}'
        pipelines.kubeflow.org/component_spec: '{"implementation": {"container": {"args":
          [], "command": ["sh", "-ec", "program_path=$(mktemp)\nprintf \"%s\" \"$0\"
          > \"$program_path\"\npython3 -u \"$program_path\" \"$@\"\n", "def convert_to_pdf():\n    import
          subprocess\n    def exec_cmd(cmd):\n        print(\"Executing \"+cmd)\n        result=subprocess.run(cmd.split(),
          stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n        stdout=result.stdout.decode(''utf-8'')
          + ''\\n''+ result.stderr.decode(''utf-8'')\n        print(\"stdout: \"+stdout)\n        return
          stdout\n    exec_cmd(\"libreoffice --headless --convert-to pdf /files/input.pptx
          --outdir /files\")\n    exec_cmd(\"ls -ltr /files\")\n\nimport argparse\n_parser
          = argparse.ArgumentParser(prog=''Convert to pdf'', description='''')\n_parsed_args
          = vars(_parser.parse_args())\n\n_outputs = convert_to_pdf(**_parsed_args)\n"],
          "image": "domnulnopcea/libreoffice-headless:latest"}}, "name": "Convert
          to pdf"}'
        pipelines.kubeflow.org/max_cache_staleness: P0D
        sidecar.istio.io/inject: "false"
      labels:
        pipelines.kubeflow.org/cache_enabled: "true"
        pipelines.kubeflow.org/enable_caching: "true"
        pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
        pipelines.kubeflow.org/pipeline-sdk-type: kfp
    name: convert-to-pdf
    outputs: {}
    volumes:
    - name: create-pvc
      persistentVolumeClaim:
        claimName: '{{inputs.parameters.create-pvc-name}}'
  - inputs: {}
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        pipelines.kubeflow.org/cache_enabled: "true"
        pipelines.kubeflow.org/enable_caching: "true"
        pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
        pipelines.kubeflow.org/pipeline-sdk-type: kfp
    name: create-pvc
    outputs:
      parameters:
      - name: create-pvc-manifest
        valueFrom:
          jsonPath: '{}'
      - name: create-pvc-name
        valueFrom:
          jsonPath: '{.metadata.name}'
      - name: create-pvc-size
        valueFrom:
          jsonPath: '{.status.capacity.storage}'
    resource:
      action: create
      manifest: |
        apiVersion: v1
        kind: PersistentVolumeClaim
        metadata:
          name: '{{workflow.name}}-my-pvc'
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi
  - container:
      args:
      - --s3-folder-path
      - '{{inputs.parameters.s3_folder_path}}'
      - --object-name
      - '{{inputs.parameters.object_name}}'
      command:
      - sh
      - -ec
      - |
        program_path=$(mktemp)
        printf "%s" "$0" > "$program_path"
        python3 -u "$program_path" "$@"
      - |
        def download_file(s3_folder_path,object_name):
            input_file_path=s3_folder_path+"/"+object_name
            import subprocess
            subprocess.run('pip install boto3'.split())
            # Download file
            import boto3
            s3=boto3.client('s3')
            s3.download_file('qa-cd-msr-20220524050318415700000001', input_file_path, '/tmp/input.pptx')
            print(input_file_path + " file is downloaded...Executing libreoffice conversion")
            subprocess.run("ls -ltr /tmp".split())

        import argparse
        _parser = argparse.ArgumentParser(prog='Download file', description='')
        _parser.add_argument("--s3-folder-path", dest="s3_folder_path", type=str, required=True, default=argparse.SUPPRESS)
        _parser.add_argument("--object-name", dest="object_name", type=str, required=True, default=argparse.SUPPRESS)
        _parsed_args = vars(_parser.parse_args())

        _outputs = download_file(**_parsed_args)
      image: python:3.7
      name: ""
      resources: {}
      volumeMounts:
      - mountPath: /tmp
        name: create-pvc
    inputs:
      parameters:
      - name: create-pvc-name
      - name: object_name
      - name: s3_folder_path
    metadata:
      annotations:
        pipelines.kubeflow.org/arguments.parameters: '{"object_name": "{{inputs.parameters.object_name}}",
          "s3_folder_path": "{{inputs.parameters.s3_folder_path}}"}'
        pipelines.kubeflow.org/component_ref: '{}'
        pipelines.kubeflow.org/component_spec: '{"implementation": {"container": {"args":
          ["--s3-folder-path", {"inputValue": "s3_folder_path"}, "--object-name",
          {"inputValue": "object_name"}], "command": ["sh", "-ec", "program_path=$(mktemp)\nprintf
          \"%s\" \"$0\" > \"$program_path\"\npython3 -u \"$program_path\" \"$@\"\n",
          "def download_file(s3_folder_path,object_name):\n    input_file_path=s3_folder_path+\"/\"+object_name\n    import
          subprocess\n    subprocess.run(''pip install boto3''.split())\n    # Download
          file\n    import boto3\n    s3=boto3.client(''s3'')\n    s3.download_file(''qa-cd-msr-20220524050318415700000001'',
          input_file_path, ''/tmp/input.pptx'')\n    print(input_file_path + \" file
          is downloaded...Executing libreoffice conversion\")\n    subprocess.run(\"ls
          -ltr /tmp\".split())\n\nimport argparse\n_parser = argparse.ArgumentParser(prog=''Download
          file'', description='''')\n_parser.add_argument(\"--s3-folder-path\", dest=\"s3_folder_path\",
          type=str, required=True, default=argparse.SUPPRESS)\n_parser.add_argument(\"--object-name\",
          dest=\"object_name\", type=str, required=True, default=argparse.SUPPRESS)\n_parsed_args
          = vars(_parser.parse_args())\n\n_outputs = download_file(**_parsed_args)\n"],
          "image": "python:3.7"}}, "inputs": [{"name": "s3_folder_path"}, {"name":
          "object_name"}], "name": "Download file"}'
        sidecar.istio.io/inject: "false"
      labels:
        pipelines.kubeflow.org/cache_enabled: "true"
        pipelines.kubeflow.org/enable_caching: "true"
        pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
        pipelines.kubeflow.org/pipeline-sdk-type: kfp
    name: download-file
    outputs: {}
    volumes:
    - name: create-pvc
      persistentVolumeClaim:
        claimName: '{{inputs.parameters.create-pvc-name}}'
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: create-pvc-name
            value: '{{tasks.create-pvc.outputs.parameters.create-pvc-name}}'
        dependencies:
        - create-pvc
        - download-file
        name: convert-to-pdf
        template: convert-to-pdf
      - arguments: {}
        name: create-pvc
        template: create-pvc
      - arguments:
          parameters:
          - name: create-pvc-name
            value: '{{tasks.create-pvc.outputs.parameters.create-pvc-name}}'
          - name: object_name
            value: '{{inputs.parameters.object_name}}'
          - name: s3_folder_path
            value: '{{inputs.parameters.s3_folder_path}}'
        dependencies:
        - create-pvc
        name: download-file
        template: download-file
    inputs:
      parameters:
      - name: object_name
      - name: s3_folder_path
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        pipelines.kubeflow.org/cache_enabled: "true"
    name: libreoffice
    outputs: {}
status:
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        accessKeySecret:
          key: accesskey
          name: mlpipeline-minio-artifact
        bucket: mlpipeline
        endpoint: minio-service.kubeflow:9000
        insecure: true
        keyFormat: artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}
        secretKeySecret:
          key: secretkey
          name: mlpipeline-minio-artifact
    default: true
  conditions:
  - status: "False"
    type: PodRunning
  - status: "True"
    type: Completed
  finishedAt: "2022-07-19T10:58:45Z"
  nodes:
    libreoffice-kck7k:
      children:
      - libreoffice-kck7k-2249554463
      displayName: libreoffice-kck7k
      finishedAt: "2022-07-19T10:58:45Z"
      id: libreoffice-kck7k
      inputs:
        parameters:
        - name: object_name
          value: Adcetris_master_288.pptx
        - name: s3_folder_path
          value: /mpsr/decks
      name: libreoffice-kck7k
      outboundNodes:
      - libreoffice-kck7k-55020225
      phase: Succeeded
      progress: 3/3
      resourcesDuration:
        cpu: 1077
        memory: 708
      startedAt: "2022-07-19T10:48:45Z"
      templateName: libreoffice
      templateScope: local/libreoffice-kck7k
      type: DAG
    libreoffice-kck7k-55020225:
      boundaryID: libreoffice-kck7k
      displayName: convert-to-pdf
      finishedAt: "2022-07-19T10:58:35Z"
      hostNodeName: ip-10-120-112-29.ec2.internal
      id: libreoffice-kck7k-55020225
      inputs:
        parameters:
        - name: create-pvc-name
          value: libreoffice-2zg86-my-pvc
      name: libreoffice-kck7k.convert-to-pdf
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: artifacts/libreoffice-kck7k/2022/07/19/libreoffice-kck7k-55020225/main.log
        exitCode: "0"
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 1066
        memory: 702
      startedAt: "2022-07-19T10:49:24Z"
      templateName: convert-to-pdf
      templateScope: local/libreoffice-kck7k
      type: Pod
    libreoffice-kck7k-2249554463:
      boundaryID: libreoffice-kck7k
      children:
      - libreoffice-kck7k-3459936430
      - libreoffice-kck7k-55020225
      displayName: create-pvc
      finishedAt: "2022-07-19T10:48:46Z"
      hostNodeName: ip-10-120-112-29.ec2.internal
      id: libreoffice-kck7k-2249554463
      name: libreoffice-kck7k.create-pvc
      outputs:
        exitCode: "0"
        parameters:
        - name: create-pvc-manifest
          value: '{"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"creationTimestamp":"2022-07-19T09:51:46Z","finalizers":["kubernetes.io/pvc-protection"],"managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:accessModes":{},"f:resources":{"f:requests":{".":{},"f:storage":{}}},"f:volumeMode":{}}},"manager":"kubectl-create","operation":"Update","time":"2022-07-19T09:51:46Z"}],"name":"libreoffice-2zg86-my-pvc","namespace":"cd-msr","resourceVersion":"197916499","uid":"7b579172-d65c-4294-b6a8-77d2f2acaa0f"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1Gi"}},"storageClassName":"gp2","volumeMode":"Filesystem"},"status":{"phase":"Pending"}}'
          valueFrom:
            jsonPath: '{}'
        - name: create-pvc-name
          value: libreoffice-2zg86-my-pvc
          valueFrom:
            jsonPath: '{.metadata.name}'
        - name: create-pvc-size
          value: ""
          valueFrom:
            jsonPath: '{.status.capacity.storage}'
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 0
        memory: 0
      startedAt: "2022-07-19T10:48:45Z"
      templateName: create-pvc
      templateScope: local/libreoffice-kck7k
      type: Pod
    libreoffice-kck7k-3459936430:
      boundaryID: libreoffice-kck7k
      children:
      - libreoffice-kck7k-55020225
      displayName: download-file
      finishedAt: "2022-07-19T10:49:19Z"
      hostNodeName: ip-10-120-112-29.ec2.internal
      id: libreoffice-kck7k-3459936430
      inputs:
        parameters:
        - name: create-pvc-name
          value: libreoffice-2zg86-my-pvc
        - name: object_name
          value: Adcetris_master_288.pptx
        - name: s3_folder_path
          value: /mpsr/decks
      name: libreoffice-kck7k.download-file
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: artifacts/libreoffice-kck7k/2022/07/19/libreoffice-kck7k-3459936430/main.log
        exitCode: "0"
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 11
        memory: 6
      startedAt: "2022-07-19T10:48:55Z"
      templateName: download-file
      templateScope: local/libreoffice-kck7k
      type: Pod
  phase: Succeeded
  progress: 3/3
  resourcesDuration:
    cpu: 1077
    memory: 708
  startedAt: "2022-07-19T10:48:45Z"
amosaini commented 2 years ago

The above workflow is simply using LibreOffice Headless to convert pptx into pdf. I saw the logs for LibreOffice as well. It has frequent SIGINT, SIGTERM system calls. I am thinking that the emissary executor sends different or new interrupts to processes that apps (in our case LibreOffice) cannot handle. Does the emissary executor kill some kind of thread on its own as well?

alexec commented 2 years ago

Thank you. Have you tried :latest?

amosaini commented 2 years ago

:latest with libreOffice.? yeah https://hub.docker.com/r/linuxserver/libreoffice#! this was the official one I tried to run on. Same issue there as well.

alexec commented 2 years ago

No, argoproj/argoexec:latest.

amosaini commented 2 years ago

Let me just try that and I will get back to you after the execution. It would help to speed up the process if you could tell me where to add it exactly.

amosaini commented 2 years ago

I changed gcr.io/ml-pipeline/argoexec:v3.1.6-patch-license-compliance to argoproj/argoexec:latest and executor to emissary now the pipeline is throwing Error (exit code 2): unexpected end of JSON input. Screenshot attached.

image
amosaini commented 2 years ago

libreoffice_fail_stack (1).txt log trace of libreoffice.

alexec commented 2 years ago

You need to run latest controller too.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

amosaini commented 2 years ago

We upgraded the platform to kubeflow 1.5. It is working there. thnx