Closed matthdsm closed 6 years ago
also, my bcbio_cwl.yaml
looks like this:
#bcbio-nextgen v1.0.9a
fc_name: bcbio_cwl
upload:
dir: ../final
globals:
analysis_regions: regions/RefSeqExomeAndPanels_20171003.bed
coverage_regions: regions/RefSeqExomeAndPanels_20171003.bed
resources:
tmp:
dir: /tmp/bcbio
details:
- analysis: variant2
genome_build: hg38
description: GiaB
algorithm:
aligner: bwa
save_diskspace: true
coverage_interval: regional
coverage: coverage_regions
mark_duplicates: true
recalibrate: false
realign: false
variantcaller: gatk-haplotype
variant_regions: analysis_regions
jointcaller: gatk-haplotype-joint
effects: false
tools_on:
- noalt_calling
- coverage_perbase
tools_off:
- gatk4
- gemini
validate: giab-NA12878/truth_small_variants.vcf.gz
validate_regions: analysis_regions
files: [ input/GiaB_R1.fastq.gz, input/GiaB_R2.fastq.gz ]
Matthias; Thanks for trying out the CWL implementation and apologies about the issues. This error comes from being unable to find the reference transcript files, specifically the BED and GTF files:
$ ls -lh hg38/rnaseq/
total 3.3G
-rw-rw-r-- 1 ubuntu ubuntu 6.7M Nov 24 2015 ref-transcripts.bed
-rw-rw-r-- 1 ubuntu ubuntu 933M Nov 24 2015 ref-transcripts.gtf
How did you install the reference genomes you're point at in your bcbio_system.yaml? It is possible that the transcript files are missing? Thanks again for testing and helping debug.
Hi Brad,
I used the installer script that comes with bcbio, so all data should be there. Strangely enough, there's no rnaseq dir. Could this have something to do with the rnaseq datatarget? I've excluded this one, since we don't do any rnaseq experiments.
The issue does seem to have gone away by adding the files.
Thanks! M
PS, do you mind I use this thread for some further questions about the CWL implementation? I'm trying to figure out how everything works so we can run CWL on our torque cluster using Toil. Once I've got everything down, I can open a PR to add my findings to the docs.
Matthias; Glad that fixed things, we'll probably need to rethink some of the installation process of bcbio as we move to CWL but it's generally good to have those files to use for annotation so no harm having them in place. Happy to answer other questions, the CWL process is still not optimal for running locally on clusters but we'd appreciate your feedback and suggestions as we improve it. Thanks again.
Hi Brad, Glad to hear you're open for comments. While trying to run I got the notification that no indexes were found for snpEff. I suppose this is because we've never used snpEff before, but I'm reinstalling all data files anyways.
Once I've been able to generate the CWL I'll update on what arguments I used etc.
Cheers M
Hi Brad,
I'd like to follow up on this. I'm getting the following error:
$ bcbio_vm.py cwl config/bcbio_cwl.yaml [12:24:09]
[2018-02-26T11:24Z] INFO: Using input YAML configuration: config/bcbio_cwl.yaml
[2018-02-26T11:24Z] INFO: Checking sample YAML configuration: config/bcbio_cwl.yaml
Traceback (most recent call last):
File "/home/galaxy/bcbio-dev/anaconda/bin/bcbio_vm.py", line 334, in <module>
args.func(args)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/main.py", line 12, in run
create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/create.py", line 36, in from_world
prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/create.py", line 425, in prep_cwl
for cur in workflow.generate(variables, steps, wfoutputs):
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 62, in generate
inputs, parallel_ids, nested_inputs = _get_step_inputs(step, file_vs, std_vs, parallel_ids)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 168, in _get_step_inputs
for orig_input in [_get_variable(x, file_vs) for x in _handle_special_inputs(step.inputs, file_vs)]:
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 314, in _handle_special_inputs
assert found_indexes, "Found no snpEff indexes in %s" % [v["id"] for v in variables]
AssertionError: Found no snpEff indexes in ['config__algorithm__align_split_size', 'files', 'reference__fasta__base', 'reference__rtg', 'genome_resources__variation__lcr', 'config__algorithm__save_diskspace', 'config__algorithm__variantcaller', 'genome_resources__rnaseq__transcripts_mask', 'genome_resources__variation__genesplicer', 'postprocess_alignment/config__algorithm__coverage_interval', 'genome_resources__rnaseq__gene_bed', 'genome_resources__variation__qsignature', 'genome_resources__variation__dbscsnv', 'rgnames__lb', 'genome_resources__variation__dbnsfp', 'genome_resources__rnaseq__rRNA_fa', 'rgnames__rg', 'config__algorithm__realign', 'metadata__batch', 'rgnames__lane', 'metadata__phenotype', 'config__algorithm__nomap_split_targets', 'genome_resources__variation__sv_repeat', 'reference__bwa__indexes', 'reference__fastagz', 'genome_resources__variation__train_hapmap', 'config__algorithm__jointcaller', 'config__algorithm__bam_clean', 'config__algorithm__nomap_split_size', 'config__algorithm__validate', 'genome_resources__rnaseq__dexseq', 'config__algorithm__validate_regions', 'genome_resources__rnaseq__transcriptome_index__tophat', 'config__algorithm__aligner', 'genome_resources__variation__train_omni', 'genome_build', 'rgnames__pu', 'genome_resources__variation__train_1000g', 'config__algorithm__recalibrate', 'postprocess_alignment/config__algorithm__coverage', 'genome_resources__rnaseq__transcripts', 'genome_resources__aliases__human', 'config__algorithm__quality_format', 'config__algorithm__tools_off', 'genome_resources__variation__dbsnp', 'config__algorithm__mark_duplicates', 'reference__twobit', 'vrn_file', 'reference__viral', 'genome_resources__version', 'genome_resources__variation__cosmic', 'genome_resources__rnaseq__refflat', 'reference__genome_context', 'config__algorithm__qc', 'analysis', 'rgnames__sample', 'config__algorithm__tools_on', 'config__algorithm__effects', 'postprocess_alignment/config__algorithm__variant_regions', 'genome_resources__aliases__ensembl', 'rgnames__pl', 'genome_resources__variation__train_indels', 'genome_resources__aliases__snpeff', 'config__algorithm__archive', 'alignment_to_rec/alignment_rec', 'postprocess_alignment/align_bam', 'alignment/hla__fastq', 'alignment/work_bam_plus__disc', 'alignment/work_bam_plus__sr', 'prep_samples_to_rec/prep_samples_rec', 'postprocess_alignment/config__algorithm__variant_regions_merged', 'postprocess_alignment/config__algorithm__variant_regions_orig', 'postprocess_alignment/config__algorithm__coverage_merged', 'postprocess_alignment/config__algorithm__coverage_orig', 'postprocess_alignment/config__algorithm__seq2c_bed_ready', 'postprocess_alignment_to_rec/postprocess_alignment_rec', 'postprocess_alignment/regions__callable', 'postprocess_alignment/regions__sample_callable', 'postprocess_alignment/regions__nblock', 'postprocess_alignment/depth__samtools__stats', 'postprocess_alignment/depth__samtools__idxstats', 'postprocess_alignment/depth__variant_regions__regions', 'postprocess_alignment/depth__variant_regions__dist', 'postprocess_alignment/depth__sv_regions__regions', 'postprocess_alignment/depth__sv_regions__dist', 'postprocess_alignment/depth__coverage__regions', 'postprocess_alignment/depth__coverage__dist', 'postprocess_alignment/depth__coverage__thresholds', 'combine_sample_regions/config__algorithm__callable_regions', 'combine_sample_regions/config__algorithm__non_callable_regions', 'combine_sample_regions/config__algorithm__callable_count']
I've tried reinstalling the data using the installer script, and doing a run using snpEff, to create any missing idexes, but to no avail.
Also, I've noticed the cwl generator creates tar archives for the directories in the genomes
dir. What's the reasoning behind this? Especially when using a local bcbio, this creates a lot of storage overhead.
Any more ideas about the error?
Thanks M
Matthias; Sorry about the issues. Can you confirm that your install has snpEff indices in the genome you're trying to use. For hg38 you should have:
/path/to/bcbio/genomes/Hsapiens/hg38/snpeff/GRCh38.86
You'd mentioned earlier you had manually removed these so perhaps they're still missing?
Sorry about the space issues with tarballs. CWL runners have variable ways of dealing with complex directories full of files so the tarballs help make the workflows portable in multiple locations. Longer term we'll not need to keep both CWL and non-CWL methods around so will avoid the duplicated files.
Hope this helps
Hi Brad,
Can confirm the files are there, and contain all the .bin
files. Any more ideas?
No problem for the tarballs, just wanted to know what they're for. If in a later stage we'd want to move to CWL only, would it suffice to leave only the tarballs and remove to original directories? Or would this cuase errors again?
Thanks M
Matthias;
Sorry about the continued issues. I'm able to replicate your error if the snpeff directory is missing or otherwise incorrect but otherwise am not sure what is causing it. Would you be able to share a tree
on your hg38 directory and the contents of your hg38-resources.yaml file? I wonder if there is some disconnect there where we're not picking up the snpeff installation.
Right now you only need tarballs for CWL runs and not also unpacked directories. The pre-built genomes we have available only have these and I'm planning to work on making those easier for local HPC installs as well.
Sorry again about these issues and hope we can identify the underlying problem.
Hi Brad,
I've included a tree of the snpEff dir. Curiously enough, I'm unable to find hg38-resources.yaml
. Could this cause the issue? Shouldn't that file be created by updating using the bcbio_nextgen.py upgrade
command?
Cheers M
Matthias;
Thanks much for helping debug. The resources file should be in hg38/seq/hg38-resources.yaml
and I wanted to check if the aliases: snpeff
line matched the GRCh38.86
specification in your snpEff directory. I think there might be some mismatch here which is driving it not finding the correct directory to prepare. Hope this helps explain the problem.
Hi Brad,
The contents of the yaml file:
[login] galaxy:seq $ cat hg38-resources.yaml [17:19:44]
version: 22
aliases:
human: true
snpeff: GRCh38.86
ensembl: homo_sapiens_merged_vep_91_GRCh38
variation:
dbsnp: ../variation/dbsnp-150.vcf.gz
dbnsfp: ../variation/dbNSFP.txt.gz
dbscsnv: ../variation/dbscSNV.txt.gz
train_hapmap: ../variation/hapmap_3.3.vcf.gz
train_omni: ../variation/1000G_omni2.5.vcf.gz
train_1000g: ../variation/1000G_phase1.snps.high_confidence.vcf.gz
train_indels: ../variation/Mills_and_1000G_gold_standard.indels.vcf.gz
cosmic: ../variation/cosmic.vcf.gz
lcr: ../coverage/problem_regions/repeats/LCR.bed.gz
sv_repeat: ../coverage/problem_regions/repeats/sv_repeat_telomere_centromere.bed
qsignature: ../variation/qsignature.txt
genesplicer: ../variation/genesplicer
rnaseq:
transcripts: ../rnaseq/ref-transcripts.gtf
transcripts_mask: ../rnaseq/ref-transcripts-mask.gtf
transcriptome_index:
tophat: ../rnaseq/tophat/hg38_transcriptome.ver
dexseq: ../rnaseq/ref-transcripts.dexseq.gff3
refflat: ../rnaseq/ref-transcripts.refFlat
rRNA_fa: ../rnaseq/rRNA.fa
The snpEff versions do seem to match. Any more ideas?
Matthias; Thanks for all this patience debugging. It does look like everything is right with the install and I think the issue might be different paths for finding the files. There is a new approach we've been working on for CWL that avoids needing all of the Galaxy *.loc files and configuration and finds things from the directory structure. This also lets us run on non-local filesystems (DNAnexus, SevenBridges, Arvados) where your reference collection is not on your system. If you add this to your bcbio system YAML file:
local:
ref: /path/to/bcbio/genomes
does that avoid the problem?
Alternatively if you remove effects: false
from your configuration and re-try, that should add in snpEff indices via the older way.
Do either of those work to avoid the issue? Thanks again for all your patience working through these issues.
Hi Brad,
I've added the local
variable to the systems yaml. This does seem to have resolved the issue for now. The bcbio_vm script has successfully created a tarball of the snpeff dir.
From my point of view, I looks like the script is stuck, since there's no output next to:
$ bcbio_vm.py cwl config/bcbio_cwl.yaml [17:58:51]
[2018-02-27T17:00Z] INFO: Using input YAML configuration: config/bcbio_cwl.yaml
[2018-02-27T17:00Z] INFO: Checking sample YAML configuration: config/bcbio_cwl.yaml
I know now the script is creating the tarballs, but it would be helpful to have a bit more verbosity, so users know it's still actually doing something. Do you suppose it would be possible to add this in?
Thanks for all the help! Much appreciated. I'll let you know if something else come up.
M
Hi Brad,
I was able to successfully create the CWL workflow for our config. I'm now trying to run it using the bcbio_vm.py script with the following command:
bcbio_vm.py cwlrun toil bcbio_cwl-workflow -- --batchSystem torque --no-container
However, it seems the --no-container
flag is ignored, because the script fails with /bin/sh: 1: docker: not found
Is this an issue with the bcbio_vm script or do I open an issue in the toil repo?
Thanks M
Matthias; Glad that the change got the CWL compiled and thanks for the feedback on not reporting back during tarring. Hopefully that process will go away when we pre-distribute these genomes but I added a message for now to avoid any confusion in the short term.
For the run, you want the --no-container
tag prior to the --
flag (which are flags passed directly to Toil instead of bcbio_vm.py flags:
bcbio_vm.py cwlrun toil --no-container bcbio_cwl-workflow -- --batchSystem torque
Sorry for the confusion and hope that fixes it for you.
Hi Brad,
Thanks for the hint! Seems to work properly without docker now. The jobs keep failing however. I've included the log file. It seems toil is missing some variables when trying to run.
Thanks again for all the help. M
Matthias; Thanks for the detailed report. This looks like a system issue writing to the directory you have set as your temporary directory:
C/X/jobaeRng1 File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/tempfile.py", line 339, in mkdtemp
C/X/jobaeRng1 _os.mkdir(file, 0700)
C/X/jobaeRng1 OSError: [Errno 13] Permission denied: '/tmp/bcbio/tmp7kf__S'
Is it possible TMPDIR and other variables are set to /tmp
but you don't have access to write there? Toil and other runners use the temporary directories to stage files so you might have to adjust this to match your system. Thanks again for the work debugging this and hope this helps.
Hi Brad,
Sorry to have been MIA on this. The issue was indeed a dir on the /tmp drive with wrong permissions. This has since been fixed, but now I've stumbled upon a new issue.
It seems that the jobs toil submits to the queue don't carry all the env variables, especially the custom bcbio $PATH
. The PATH
variable is reset in the jobs to the default value. This causes all jobs to fail due to missing requirements.
I can see that toil runs with the flag --preserve-environment PATH HOME
, which should preserve those values I suppose, but seems to be ignored. I've tried setting export TOIL_TORQUE_ARGS="-V PATH=$PATH"
as specified in the Toil docs , but this just causes the qsub
command to throw an error.
Any idea what to do?
Thanks M
Hi Brad, After adding the missing deps to the default path, I'm seeing a lot of complaints about missing pickle files:
Traceback (most recent call last):
File "/Shared/Software/bin/_toil_worker", line 11, in <module>
load_entry_point('toil==3.5.0a1', 'console_scripts', '_toil_worker')()
File "/Shared/Software/lib/python2.7/site-packages/toil/worker.py", line 100, in main
jobStore = Toil.resumeJobStore(jobStoreLocator)
File "/Shared/Software/lib/python2.7/site-packages/toil/common.py", line 767, in resumeJobStore
jobStore.resume()
File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 75, in resume
super(FileJobStore, self).resume()
File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/abstractJobStore.py", line 149, in resume
with self.readSharedFileStream('config.pickle') as fileHandle:
File "/Shared/Software/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 296, in readSharedFileStream
with open(os.path.join(self.jobStoreDir, sharedFileName), 'r') as f:
IOError: [Errno 20] Not a directory: '/home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/process_alignment.cwl/config.pickle'
I also get job errors stating
Traceback (most recent call last):
File "/Shared/Software/bin/_toil_worker", line 11, in <module>
load_entry_point('toil==3.5.0a1', 'console_scripts', '_toil_worker')()
File "/Shared/Software/lib/python2.7/site-packages/toil/worker.py", line 100, in main
jobStore = Toil.resumeJobStore(jobStoreLocator)
File "/Shared/Software/lib/python2.7/site-packages/toil/common.py", line 767, in resumeJobStore
jobStore.resume()
File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 73, in resume
raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/home/projects/bcbio_annotation/test_runs/cwl/toil_work/CWLWorkflow' does not exist, so there is nothing to restart
This does seem oddly intermittent, because there's most of the time 1 job that survives and runs (success!). However, it seems there's a bug here in the BatchSystem API for Torque. The submitted job only reserves 1 core, while using 6 threads. (fixed with a PR)
M
Matthias; Thanks for all this detailed testing. For the PATH issue, @brainstorm worked on passing environmental variables through from the parent environment so it should happen within Torque according to the code:
I've noticed that you have an old version of Toil referenced (toil==3.5.0a1
) which might not include these. Is it possible your install doesn't have the latest release (3.14.0) so not include these fixes? Could you try doing:
bcbiovm_conda install -c conda-forge -c bioconda toil
to see if that helps? That also might help with the intermittent file issue. Thanks for the Torque system patch and for all this testing work.
Hi Brad,
Thanks for all the help. I've updated everything I could find, including the bcbio_vm.py
script, toil, everything. I ran into an intermediate issue because of a Toil dependency on an older version of dill
, but that should be fixed now too.
The missing dir issue hasn't been resolved sadly. I've tried regenerating the CWL configs, removing and creating directories etc, to no avail.
I've included the bcbio_cwl-toil.log file again. Perhaps you can make something out if it. I really hope so. Also, the issue with the wrong PATH hasn't been resolved with the update.
Thanks again,
Cheers M bcbio_cwl-toil.log toil_job_1.e46946.txt
Matthias -- thanks for following up. The latest run does look to be using the latest toil, thanks for updating.
Could you try removing the problem toil directory (/home/projects/bcbio_annotation/test_runs/cwl/toil_work/
) and restarting from scratch? It seems like the run got in a bad state it's not able to recover from. I'm hoping a fresh run might work cleaner with the latest toil and avoid these issues. Thanks again for the help debugging.
Hi Brad,
Tried that already, doesn't help :-(. Any more ideas? I'm really at a loss what could be wrong.
Thanks M
Matthias; Sorry, I'm not exactly sure what to suggest at this point. Some suggestions:
I know Toil has some rough edges right now so it would be useful if we could identify a reproducible error case and try to fix the underlying issue. Thanks again fo the help debugging.
Hi Brad,
I just ran a test on a single machine using toil, which failed again. I got the following report:
login 2018-03-13 09:05:10,074 MainThread INFO toil.leader: Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn p
rep_samples_to_rec cwl u/a/job9lHKQm
2018-03-13 09:05:10,074 - toil.leader - INFO - Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_
to_rec cwl u/a/job9lHKQm
login 2018-03-13 09:08:58,472 MainThread INFO toil.leader: Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alig
nment_to_rec cwl A/Q/jobTDaCja
2018-03-13 09:08:58,472 - toil.leader - INFO - Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl A/Q/jobTDaCja
login 2018-03-13 09:08:58,516 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...
2018-03-13 09:08:58,516 - toil.serviceManager - INFO - Waiting for service manager thread to finish ...
login 2018-03-13 09:08:59,276 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 0.75955915451 seconds
2018-03-13 09:08:59,276 - toil.serviceManager - INFO - ... finished shutting down the service manager. Took 0.75955915451 seconds
login 2018-03-13 09:08:59,277 MainThread INFO toil.statsAndLogging: Waiting for stats and logging collator thread to finish ...
2018-03-13 09:08:59,277 - toil.statsAndLogging - INFO - Waiting for stats and logging collator thread to finish ...
login 2018-03-13 09:09:02,411 MainThread INFO toil.statsAndLogging: ... finished collating stats and logs. Took 3.13325619698 seconds
2018-03-13 09:09:02,411 - toil.statsAndLogging - INFO - ... finished collating stats and logs. Took 3.13325619698 seconds
Traceback (most recent call last):
File "/home/galaxy/bcbio-dev/anaconda/bin/cwltoil", line 11, in <module>
load_entry_point('toil==3.14.0', 'console_scripts', 'cwltoil')()
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 990, in main
outobj = toil.start(wf1)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 743, in start
return self._runMainLoop(rootJobGraph)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 1018, in _runMainLoop
jobCache=self._jobCache).run()
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 197, in run
self.innerLoop()
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 400, in innerLoop
self.issueJobs(successors)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 571, in issueJobs
self.issueJob(job)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 550, in issueJob
jobBatchSystemID = self.batchSystem.issueBatchJob(jobNode)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/batchSystems/singleMachine.py", line 230, in issueBatchJob
'set to {}.'.format(jobNode.jobName, cores, self.maxCores, self.scale))
AssertionError: The job file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/prep_align_inputs.cwl is requesting 6.0 cores, more than the maximum of 2 cores this batch system was configured with. Scale is set to 1.
Traceback (most recent call last):
File "/home/galaxy/bcbio-dev/anaconda/bin/bcbio_vm.py", line 334, in <module>
args.func(args)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 177, in run
_TOOLS[args.tool](args)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 108, in _run_toil
_run_tool(cmd, not args.no_container, work_dir)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 38, in _run_tool
subprocess.check_call(cmd, shell=True)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'export PATH=$PATH:/home/galaxy/bcbio-dev/anaconda/bin && cwltoil --jobStore /home/projects/bcbio_annotation/test_runs/cwl/toil_work/cwltoil_jobstore --logFile /home/projects/bcbio_annotation/test_runs/cwl/toil_work/bcbio_cwl-toil.log --workDir /home/projects/bcbio_annotation/test_runs/cwl/toil_work/tmpdir --linkImports --maxCores 2 --no-container --preserve-environment PATH HOME -- /home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/main-bcbio_cwl.cwl /home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/main-bcbio_cwl-samples.json' returned non-zero exit status 1
Which seems to me like it's an issue with the resource requests. Looks strange to me. I started toil using bcbio_vm.py cwlrun toil --no-container bcbio_cwl-workflow -- --maxCores 2
.
Tests omitting the --maxCores
flag throw the same error.
Next step is a test with the pregenerated CWL from your repo. Cheers M
And sadly, again an when running the somatic workflow from test_bcbio_cwl. somatic-toil.log
Traceback:
2018-03-13 09:49:24,407 - toil.leader - INFO - Finished toil run with 5 failed jobs
login 2018-03-13 09:49:24,408 MainThread INFO toil.leader: Failed jobs at end of the run: 'CWLGather' 0/W/jobN7a6pc 'CWLWorkflow' c/K/joblkKODA 'CWLScatter' q/f/jobBZPvWo 'file:///home/projects/bcbio_annotation/
test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_to_rec cwl l/a/job2UuziU 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/soma
tic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
2018-03-13 09:49:24,408 - toil.leader - INFO - Failed jobs at end of the run: 'CWLGather' 0/W/jobN7a6pc 'CWLWorkflow' c/K/joblkKODA 'CWLScatter' q/f/jobBZPvWo 'file:///home/projects/bcbio_annotation/test_runs/te
st_bcbio_cwl/somatic/somatic-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_to_rec cwl l/a/job2UuziU 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-
workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
Traceback (most recent call last):
File "/home/galaxy/bcbio-dev/anaconda/bin/cwltoil", line 11, in <module>
load_entry_point('toil==3.14.0', 'console_scripts', 'cwltoil')()
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 990, in main
outobj = toil.start(wf1)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 743, in start
return self._runMainLoop(rootJobGraph)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 1018, in _runMainLoop
jobCache=self._jobCache).run()
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 226, in run
raise FailedJobsException(self.config.jobStore, self.toilState.totalFailedJobs, self.jobStore)
toil.leader.FailedJobsException: The job store 'file:/home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/cwltoil_jobstore' contains 5 failed jobs: 'CWLGather' 0/W/jobN7a6pc, 'CWLWorkflow'
c/K/joblkKODA, 'CWLScatter' q/f/jobBZPvWo, 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_to_rec cwl
l/a/job2UuziU, 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
=========> Failed job 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.14.0-b91dbf9bf6116879952f0a70f9a2fbbcae7e51b6.
WARNING:toil.fileStore:Starting job c/K/joblkKODA/g/tmpCRN3Tr.tmp with less than 10% of disk space remaining.
Traceback (most recent call last):
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/worker.py", line 309, in workerScript
job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/job.py", line 1328, in _runner
returnValues = self._run(jobGraph, fileStore)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/job.py", line 1273, in _run
return self.run(fileStore)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 462, in run
(output, status) = cwltool.main.single_job_executor(**opts)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/cwltool/main.py", line 298, in single_job_executor
r.run(**kwargs)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/cwltool/job.py", line 286, in run
self._setup(kwargs)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/cwltool/job.py", line 158, in _setup
"file." % (knownfile, self.pathmapper.mapper(knownfile)[0]))
WorkflowException: Input file file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/testdata/100326_FC6107FAAXX/7_100326_FC6107FAAXX_1.fq.gz.gbi (at /home/projects/bcbio_annotation/test_runs/test_bcbio
_cwl/testdata/100326_FC6107FAAXX/7_100326_FC6107FAAXX_1.fq.gz.gbi) not found or is not a regular file.
ERROR:toil.worker:Exiting the worker because of a failed job on host login
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_n
extgen.py runfn alignment_to_rec cwl x/T/job6f0j0L with ID x/T/job6f0j0L to 0
<=========
Traceback (most recent call last):
File "/home/galaxy/bcbio-dev/anaconda/bin/bcbio_vm.py", line 334, in <module>
args.func(args)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 177, in run
_TOOLS[args.tool](args)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 108, in _run_toil
_run_tool(cmd, not args.no_container, work_dir)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 38, in _run_tool
subprocess.check_call(cmd, shell=True)
File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'export PATH=$PATH:/home/galaxy/bcbio-dev/anaconda/bin && cwltoil --jobStore /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/cwltoil_jobstore --l
ogFile /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/somatic-toil.log --workDir /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/tmpdir --linkImports --no-c
ontainer --preserve-environment PATH HOME -- /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/main-somatic.cwl /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/so
matic-workflow/main-somatic-samples.json' returned non-zero exit status 1
On a better note, both workflows seem to work with Bunny, so I suppose the issue lies with toil.
Cheers M
@matthdsm This all reads fairly familiar. We've tried for a long time to get the bcbio/CWL/Toil combination to work for different HPC environments (SLURM and PBSPro) but have put the effort on hold for now. Between the layers of stacks involved it became a nightmare to track down issues; maybe Toil debugging support has improved by now, and I'm sure it's mostly user errors on our side, but for the time being we're looking at Arvados (and hopefully Cromwell down the road) to give this another try.
Hi Oliver,
Sorry to hear you haven't had any luck with toil too. At the moment, Arvados isn't really on the roadmap for us. We are however, very interested in Cromwell and Funnel. @chapmanb I saw in the bcbio_vm repo that you're working on funnel integration? What's the status and prerequisites of that?
M
Perhaps @ejacox or @DailyDreaming would like to add their two cents on this?
M
@matthdsm We've had low bandwidth recently for HPC and have been focusing on cloud provisioning. I'd love to see more done with CWL on HPC, as I see that as a great userbase that could become a great community, but the current directives we've been given are for robust provisioning in aws, google, and azure.
That said, I've seen a similar error before using cwltoil and on first inspection (at least previously) it seemed to be a problem with how cwltool was talking with toil (currently toil breaks up cwl workflows into chunks and passes them to cwltool to process as jobs so they pass filepaths back and forth; toil deletes things constantly though and I'm not sure the cwltool pathmapper always returns a path that makes sense to toil, especially with glob). That's my 2 cents, though it may be way off the mark, and this might be a different problem entirely.
Once we're done with the cloud provisioning I'd definitely like to come back and spend some serious time updating cwl support (and debugging in general).
Thanks all for this great discussion. We're definitely still interested in having Toil running on HPC but are waiting patiently until it's also a priority for the team there. Both it and improved debugging would be a huge help to make Toil CWL a first class citizen there, but I know they've been busy with Cloud support so am happy to explore this more when the team is ready. We can run Toil workflows on HPC but the major issue is debugging and unblocking when things don't work, as Matthias is hitting.
The funnel support in bcbio was only exploratory and isn't expected to work yet. This would be a cool direction to explore more, but I don't know all the ins and outs of getting funnel running on an HPC system. Matthias, if you look into this I'd love to hear your feedback.
As Oliver mentioned, Arvados CWL works great and we've scaled that to 1000s of samples on Arvados clusters. I know this is a bigger install but an option if you end up exploring that direction.
We're hopeful Cromwell will be available soon for testing with bcbio CWL, and definitely plan to work on that as soon as it's available. The team has been doing a ton of hard work supporting CWL and use Cromwell on HPC systems so it fits their development efforts and our needs well.
Sorry we don't have a single solution here right now but hope that helps explain the current state and directions. Thanks again for this great discussion.
Hi Brad, I think we're going to put Toil on the backburner until the team is ready to throw some time at it. We're going to be exploring Funnel a bit, as it's part of the GA4GH ref implementation. If I come across something, I'll let you know.
I'll be keeping an eye on CWL support on Cromwell, but could you give me a ping when you think everything's good to go on the bcbio side?
Thanks everyone for all the ongoing help and development.
Cheers M
Hi Brad,
I'm trying to get the bcbio CWL implementation to work, but I keep running into the following error:
I'm running the latest dev version of bcbio, locally installed, with bcbio_vm.py in the same env. I invoke the command with
bcbio_vm.py cwl config/bcbio_cwl.yaml
, which, I assume, detects the system_config from the local bcbio install.Specifying the actual
system_config.yaml
doesn't help and results in the same error.Any ideas?
Cheers M