elixir-cloud-aai / TESK

GA4GH Task Execution Service Root Project + Deployment scripts on Kubernetes
https://tesk.readthedocs.io
Apache License 2.0
40 stars 29 forks source link

Failing TESK tasks when integrated with cromwell server #117

Closed ghost closed 4 years ago

ghost commented 4 years ago

Problem

I am running TESK in combination with cromwell/wdl in a k8s cluster. Cromwell has access to a smb/cifs file share which is also linked to a PV/PVC that I defined as the transfer PVC for TESK. For simple workflows this setup works very well, however, I tried a resource heavy workflow which I would use on production data and it never runs through completely as there are random errors occurring at different stages in the pipeline. This seems to be related to files/directories, which are being copied to the task PVC and then disappear or get somehow corrupted. The storage class I use for these task PVCs is object storage with rook/ceph blocks. This behaviour occurs both with shared file system and when using ftp. Not sure if it is a TESK problem or a storage class problem.

Workflow and data to reproduce bug

The dataset I used to reproduce the bug is available at base space, run 20160127AN_NMP Baseline_12plex. If the bs client is installed, one can run something like:

bs download run --name "20160127AN_NMP Baseline_12plex" -o 20160127AN_NMPBaseline_12plex/

The workflow to run is the attached main.txt and as input one needs to adjust this input.json:

{
"MicrobioGenomeAssemblyAndAnalysis.run_folder": "/data/20160127AN_NMPBaseline_12plex",
"MicrobioGenomeAssemblyAndAnalysis.sample_sheet": "/data/20160127AN_NMPBaseline_12plex/SampleSheet.csv" ,
"MicrobioGenomeAssemblyAndAnalysis.cromwell_path_prefix": "/data"
}

Expected behaviour

That input filer copies the files from the transfer PVC to the task PVC, the task is run by the task executer producing results which are written to the task PVC, and the output filer copies the results back to the transfer PVC.

ghost commented 4 years ago

I is sc related - it seems that somehow the rook/ceph blocks become corrupted or something and files disappear or become truncated. I tried with a nfs provisioner and I do not see this error.