Duke-GCB / calrissian

CWL on Kubernetes
https://duke-gcb.github.io/calrissian/
MIT License
42 stars 15 forks source link

contributing support for running calrissian on AKS (Azure), EKS (AWS) and GKE (Google) #124

Open pymonger opened 2 years ago

pymonger commented 2 years ago

Greetings,

I'm interested in using calrissian to run CWL workflows on the K8s service for the 3 major cloud vendors. I'm starting with Azure and am running into caveats (e.g. https://github.com/Duke-GCB/calrissian/issues/123) that are relate to the ReadWriteMany requirement of PersistentVolumes. I'm willing to work through these issues for each of the cloud vendors but would like to know what would be the best approach to implement them for contribution back to main. Since calrissian uses capability in https://github.com/common-workflow-language/cwltool some of the kludges I've implemented just to get it to work on Azure actually required me to update cwltool (e.g. https://github.com/common-workflow-language/cwltool/pull/1544). That's probably not the right approach so I'm looking for guidance on whether to proceed with making updates to cwltool or to find a way to build in the capability into calrissian.

Thanks in advance.

fabricebrito commented 2 years ago

@pymonger can you share the current and expected behaviour? We'd be happy to help on getting Calrissian to work on several KaaS providers

fabricebrito commented 2 years ago

@pymonger regarding https://github.com/pymonger/soamc-cwl-demo#google-kubernetes-engine and the associated cost, we use https://longhorn.io/ as it provides ReadWriteMany using the nodes' disks. I wonder if that works on GKE.

pymonger commented 2 years ago

@fabricebrito: without making these changes to cwltool:

https://github.com/common-workflow-language/cwltool/pull/1544/files

I would get the following error:

--------------------------------------------------------------------------------
apiVersion: v1
kind: Pod
metadata:
  labels: {}
  name: stage-in-cwl-pod-ydxduxah
spec:
  containers:
  - args:
    - curl -O https://s3-us-west-2.amazonaws.com/landsat-pds/L8/010/117/LC80101172015002LGN00/LC80101172015002LGN00_BQA.TIF
      > stdout_stage-in.txt 2> stderr_stage-in.txt
    command:
    - /bin/sh
    - -c
    env:
    - name: HOME
      value: /XiTfjy
    - name: TMPDIR
      value: /tmp
    image: curlimages/curl
    name: stage-in-cwl-container
    resources:
      requests:
        cpu: '1'
        memory: 1024Mi
    volumeMounts:
    - mountPath: /XiTfjy
      name: calrissian-tmpout
      readOnly: false
      subPath: sqh3fknm
    - mountPath: /tmp
      name: tmpdir
    workingDir: /XiTfjy
  initContainers: []
  restartPolicy: Never
  securityContext:
    runAsGroup: 0
    runAsUser: 1001
  volumes:
  - name: calrissian-input-data
    persistentVolumeClaim:
      claimName: calrissian-input-data
      readOnly: true
  - name: calrissian-tmpout
    persistentVolumeClaim:
      claimName: calrissian-tmpout
      readOnly: false
  - name: calrissian-output-data
    persistentVolumeClaim:
      claimName: calrissian-output-data
      readOnly: false
  - emptyDir: {}
    name: tmpdir
--------------------------------------------------------------------------------

Created k8s pod name stage-in-cwl-pod-ydxduxah with id f17fc3f2-b49b-4182-bfc1-379eaac5a691
PodMonitor adding stage-in-cwl-pod-ydxduxah
k8s pod 'stage-in-cwl-pod-ydxduxah' started
[stage-in-cwl-pod-ydxduxah] follow_logs start
[stage-in-cwl-pod-ydxduxah] follow_logs end
Handling terminated pod name stage-in-cwl-pod-ydxduxah with id f17fc3f2-b49b-4182-bfc1-379eaac5a691
handling completion with 0
PodMonitor removing stage-in-cwl-pod-ydxduxah
shutil.rmtree(/tmp/tjb__2wk, True)
shutil.rmtree(/tmp/4oavux2h, True)
DEBUG restore [ram: 1024, cores: 1] to available [ram: 14976.0, cores: 7.0]
DEBUG Finishing ThreadPoolExecutor.run_jobs: total_resources=[ram: 16000.0, cores: 8.0], available_resources=[ram: 16000.0, cores: 8.0]
DEBUG Moving /calrissian/tmpout/sqh3fknm/LC80101172015002LGN00_BQA.TIF to /calrissian/output-data/LC80101172015002LGN00_BQA.TIF
ERROR Unhandled error:
  [Errno 1] Operation not permitted
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/shutil.py", line 566, in move
    os.rename(src, real_dst)
OSError: [Errno 18] Invalid cross-device link: '/calrissian/tmpout/sqh3fknm/LC80101172015002LGN00_BQA.TIF' -> '/calrissian/output-data/LC80101172015002LGN00_BQA.TIF'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/cwltool/main.py", line 1248, in main
    tool, initialized_job_order_object, runtimeContext, logger=_logger
  File "/usr/local/lib/python3.7/site-packages/cwltool/executors.py", line 60, in __call__
    return self.execute(process, job_order_object, runtime_context, logger)
  File "/usr/local/lib/python3.7/site-packages/cwltool/executors.py", line 157, in execute
    path_mapper=runtime_context.path_mapper,
  File "/usr/local/lib/python3.7/site-packages/cwltool/process.py", line 401, in relocateOutputs
    stage_files(pm, stage_func=_relocate, symlink=False, fix_conflicts=True)
  File "/usr/local/lib/python3.7/site-packages/cwltool/process.py", line 297, in stage_files
    stage_func(entry.resolved, entry.target)
  File "/usr/local/lib/python3.7/site-packages/cwltool/process.py", line 374, in _relocate
    shutil.move(src, dst)
  File "/usr/local/lib/python3.7/shutil.py", line 580, in move
    copy_function(src, real_dst)
  File "/usr/local/lib/python3.7/shutil.py", line 267, in copy2
    copystat(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/local/lib/python3.7/shutil.py", line 206, in copystat
    follow_symlinks=follow)
PermissionError: [Errno 1] Operation not permitted
Starting Cleanup
Finishing Cleanup

I filed this github issue on it but closed it because I thought it was straightforward to add an Azure StorageClass that supports ReadWriteMany:

https://github.com/Duke-GCB/calrissian/issues/123

The issue is that the Azure StorageClass that supports it is based on AzureFile which mounts volumes using CIF and doesn't allow the modification of file attributes which is why I get the above PermissionError:

https://docs.microsoft.com/en-us/answers/questions/89827/how-can-i-change-folder-or-file-permissions-when-m.html

So for the time being, I'm using my fork of cwltool (https://github.com/pymonger/cwltool/tree/handle-unsupported-file-ops) to work with calrissian to address the issue above.

In regards to GKE, thanks for the pointer to longhorn. I'll look into it. I was able to run my CWL workflows on GKE using an NFS solution as described here but longhorn may be a better solution for operational use:

https://medium.com/@Sushil_Kumar/readwritemany-persistent-volumes-in-google-kubernetes-engine-a0b93e203180