bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

CWL questions #2281

Closed matthdsm closed 6 years ago

matthdsm commented 6 years ago

Hi Brad,

I'm trying to get the bcbio CWL implementation to work, but I keep running into the following error:

$ bcbio_vm.py cwl config/bcbio_cwl.yaml                                                                                                                                               [9:20:21]
[2018-02-21T08:21Z] INFO: Using input YAML configuration: config/bcbio_cwl.yaml
[2018-02-21T08:21Z] INFO: Checking sample YAML configuration: config/bcbio_cwl.yaml
Traceback (most recent call last):
  File "/home/galaxy/bcbio-dev/anaconda/bin/bcbio_vm.py", line 334, in <module>
    args.func(args)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/main.py", line 12, in run
    create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/create.py", line 36, in from_world
    prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/create.py", line 425, in prep_cwl
    for cur in workflow.generate(variables, steps, wfoutputs):
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 62, in generate
    inputs, parallel_ids, nested_inputs = _get_step_inputs(step, file_vs, std_vs, parallel_ids)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 168, in _get_step_inputs
    for orig_input in [_get_variable(x, file_vs) for x in _handle_special_inputs(step.inputs, file_vs)]:
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 285, in _get_variable
    raise ValueError("Did not find variable %s in \n%s" % (vid, pprint.pformat(variables)))
ValueError: Did not find variable genome_resources__rnaseq__gene_bed in [json_config_array]

I'm running the latest dev version of bcbio, locally installed, with bcbio_vm.py in the same env. I invoke the command with bcbio_vm.py cwl config/bcbio_cwl.yaml, which, I assume, detects the system_config from the local bcbio install.

Specifying the actual system_config.yaml doesn't help and results in the same error.

Any ideas?

Cheers M

matthdsm commented 6 years ago

also, my bcbio_cwl.yaml looks like this:

#bcbio-nextgen v1.0.9a
fc_name: bcbio_cwl
upload:
  dir: ../final
globals:
  analysis_regions: regions/RefSeqExomeAndPanels_20171003.bed
  coverage_regions: regions/RefSeqExomeAndPanels_20171003.bed
resources:
  tmp:
    dir: /tmp/bcbio
details:
  - analysis: variant2
    genome_build: hg38
    description: GiaB
    algorithm:
      aligner: bwa
      save_diskspace: true
      coverage_interval: regional
      coverage: coverage_regions
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller: gatk-haplotype
      variant_regions: analysis_regions
      jointcaller: gatk-haplotype-joint
      effects: false
      tools_on:
        - noalt_calling
        - coverage_perbase
      tools_off:
        - gatk4
        - gemini
      validate: giab-NA12878/truth_small_variants.vcf.gz
      validate_regions: analysis_regions
    files: [ input/GiaB_R1.fastq.gz, input/GiaB_R2.fastq.gz ]
chapmanb commented 6 years ago

Matthias; Thanks for trying out the CWL implementation and apologies about the issues. This error comes from being unable to find the reference transcript files, specifically the BED and GTF files:

$ ls -lh hg38/rnaseq/
total 3.3G
-rw-rw-r-- 1 ubuntu ubuntu 6.7M Nov 24  2015 ref-transcripts.bed
-rw-rw-r-- 1 ubuntu ubuntu 933M Nov 24  2015 ref-transcripts.gtf

How did you install the reference genomes you're point at in your bcbio_system.yaml? It is possible that the transcript files are missing? Thanks again for testing and helping debug.

matthdsm commented 6 years ago

Hi Brad,

I used the installer script that comes with bcbio, so all data should be there. Strangely enough, there's no rnaseq dir. Could this have something to do with the rnaseq datatarget? I've excluded this one, since we don't do any rnaseq experiments.

The issue does seem to have gone away by adding the files.

Thanks! M

PS, do you mind I use this thread for some further questions about the CWL implementation? I'm trying to figure out how everything works so we can run CWL on our torque cluster using Toil. Once I've got everything down, I can open a PR to add my findings to the docs.

chapmanb commented 6 years ago

Matthias; Glad that fixed things, we'll probably need to rethink some of the installation process of bcbio as we move to CWL but it's generally good to have those files to use for annotation so no harm having them in place. Happy to answer other questions, the CWL process is still not optimal for running locally on clusters but we'd appreciate your feedback and suggestions as we improve it. Thanks again.

matthdsm commented 6 years ago

Hi Brad, Glad to hear you're open for comments. While trying to run I got the notification that no indexes were found for snpEff. I suppose this is because we've never used snpEff before, but I'm reinstalling all data files anyways.

Once I've been able to generate the CWL I'll update on what arguments I used etc.

Cheers M

matthdsm commented 6 years ago

Hi Brad,

I'd like to follow up on this. I'm getting the following error:

$ bcbio_vm.py cwl config/bcbio_cwl.yaml                                                                                                                                              [12:24:09]
[2018-02-26T11:24Z] INFO: Using input YAML configuration: config/bcbio_cwl.yaml
[2018-02-26T11:24Z] INFO: Checking sample YAML configuration: config/bcbio_cwl.yaml
Traceback (most recent call last):
  File "/home/galaxy/bcbio-dev/anaconda/bin/bcbio_vm.py", line 334, in <module>
    args.func(args)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/main.py", line 12, in run
    create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/create.py", line 36, in from_world
    prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/create.py", line 425, in prep_cwl
    for cur in workflow.generate(variables, steps, wfoutputs):
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 62, in generate
    inputs, parallel_ids, nested_inputs = _get_step_inputs(step, file_vs, std_vs, parallel_ids)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 168, in _get_step_inputs
    for orig_input in [_get_variable(x, file_vs) for x in _handle_special_inputs(step.inputs, file_vs)]:
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/workflow.py", line 314, in _handle_special_inputs
    assert found_indexes, "Found no snpEff indexes in %s" % [v["id"] for v in variables]
AssertionError: Found no snpEff indexes in ['config__algorithm__align_split_size', 'files', 'reference__fasta__base', 'reference__rtg', 'genome_resources__variation__lcr', 'config__algorithm__save_diskspace', 'config__algorithm__variantcaller', 'genome_resources__rnaseq__transcripts_mask', 'genome_resources__variation__genesplicer', 'postprocess_alignment/config__algorithm__coverage_interval', 'genome_resources__rnaseq__gene_bed', 'genome_resources__variation__qsignature', 'genome_resources__variation__dbscsnv', 'rgnames__lb', 'genome_resources__variation__dbnsfp', 'genome_resources__rnaseq__rRNA_fa', 'rgnames__rg', 'config__algorithm__realign', 'metadata__batch', 'rgnames__lane', 'metadata__phenotype', 'config__algorithm__nomap_split_targets', 'genome_resources__variation__sv_repeat', 'reference__bwa__indexes', 'reference__fastagz', 'genome_resources__variation__train_hapmap', 'config__algorithm__jointcaller', 'config__algorithm__bam_clean', 'config__algorithm__nomap_split_size', 'config__algorithm__validate', 'genome_resources__rnaseq__dexseq', 'config__algorithm__validate_regions', 'genome_resources__rnaseq__transcriptome_index__tophat', 'config__algorithm__aligner', 'genome_resources__variation__train_omni', 'genome_build', 'rgnames__pu', 'genome_resources__variation__train_1000g', 'config__algorithm__recalibrate', 'postprocess_alignment/config__algorithm__coverage', 'genome_resources__rnaseq__transcripts', 'genome_resources__aliases__human', 'config__algorithm__quality_format', 'config__algorithm__tools_off', 'genome_resources__variation__dbsnp', 'config__algorithm__mark_duplicates', 'reference__twobit', 'vrn_file', 'reference__viral', 'genome_resources__version', 'genome_resources__variation__cosmic', 'genome_resources__rnaseq__refflat', 'reference__genome_context', 'config__algorithm__qc', 'analysis', 'rgnames__sample', 'config__algorithm__tools_on', 'config__algorithm__effects', 'postprocess_alignment/config__algorithm__variant_regions', 'genome_resources__aliases__ensembl', 'rgnames__pl', 'genome_resources__variation__train_indels', 'genome_resources__aliases__snpeff', 'config__algorithm__archive', 'alignment_to_rec/alignment_rec', 'postprocess_alignment/align_bam', 'alignment/hla__fastq', 'alignment/work_bam_plus__disc', 'alignment/work_bam_plus__sr', 'prep_samples_to_rec/prep_samples_rec', 'postprocess_alignment/config__algorithm__variant_regions_merged', 'postprocess_alignment/config__algorithm__variant_regions_orig', 'postprocess_alignment/config__algorithm__coverage_merged', 'postprocess_alignment/config__algorithm__coverage_orig', 'postprocess_alignment/config__algorithm__seq2c_bed_ready', 'postprocess_alignment_to_rec/postprocess_alignment_rec', 'postprocess_alignment/regions__callable', 'postprocess_alignment/regions__sample_callable', 'postprocess_alignment/regions__nblock', 'postprocess_alignment/depth__samtools__stats', 'postprocess_alignment/depth__samtools__idxstats', 'postprocess_alignment/depth__variant_regions__regions', 'postprocess_alignment/depth__variant_regions__dist', 'postprocess_alignment/depth__sv_regions__regions', 'postprocess_alignment/depth__sv_regions__dist', 'postprocess_alignment/depth__coverage__regions', 'postprocess_alignment/depth__coverage__dist', 'postprocess_alignment/depth__coverage__thresholds', 'combine_sample_regions/config__algorithm__callable_regions', 'combine_sample_regions/config__algorithm__non_callable_regions', 'combine_sample_regions/config__algorithm__callable_count']

I've tried reinstalling the data using the installer script, and doing a run using snpEff, to create any missing idexes, but to no avail.

Also, I've noticed the cwl generator creates tar archives for the directories in the genomes dir. What's the reasoning behind this? Especially when using a local bcbio, this creates a lot of storage overhead.

Any more ideas about the error?

Thanks M

chapmanb commented 6 years ago

Matthias; Sorry about the issues. Can you confirm that your install has snpEff indices in the genome you're trying to use. For hg38 you should have:

/path/to/bcbio/genomes/Hsapiens/hg38/snpeff/GRCh38.86

You'd mentioned earlier you had manually removed these so perhaps they're still missing?

Sorry about the space issues with tarballs. CWL runners have variable ways of dealing with complex directories full of files so the tarballs help make the workflows portable in multiple locations. Longer term we'll not need to keep both CWL and non-CWL methods around so will avoid the duplicated files.

Hope this helps

matthdsm commented 6 years ago

Hi Brad,

Can confirm the files are there, and contain all the .bin files. Any more ideas?

No problem for the tarballs, just wanted to know what they're for. If in a later stage we'd want to move to CWL only, would it suffice to leave only the tarballs and remove to original directories? Or would this cuase errors again?

Thanks M

chapmanb commented 6 years ago

Matthias; Sorry about the continued issues. I'm able to replicate your error if the snpeff directory is missing or otherwise incorrect but otherwise am not sure what is causing it. Would you be able to share a tree on your hg38 directory and the contents of your hg38-resources.yaml file? I wonder if there is some disconnect there where we're not picking up the snpeff installation.

Right now you only need tarballs for CWL runs and not also unpacked directories. The pre-built genomes we have available only have these and I'm planning to work on making those easier for local HPC installs as well.

Sorry again about these issues and hope we can identify the underlying problem.

matthdsm commented 6 years ago

Hi Brad,

I've included a tree of the snpEff dir. Curiously enough, I'm unable to find hg38-resources.yaml. Could this cause the issue? Shouldn't that file be created by updating using the bcbio_nextgen.py upgradecommand?

Cheers M

chapmanb commented 6 years ago

Matthias; Thanks much for helping debug. The resources file should be in hg38/seq/hg38-resources.yaml and I wanted to check if the aliases: snpeff line matched the GRCh38.86 specification in your snpEff directory. I think there might be some mismatch here which is driving it not finding the correct directory to prepare. Hope this helps explain the problem.

matthdsm commented 6 years ago

Hi Brad,

The contents of the yaml file:

[login] galaxy:seq $ cat hg38-resources.yaml                                                                                                                                                     [17:19:44]
version: 22

aliases:
  human: true
  snpeff: GRCh38.86
  ensembl: homo_sapiens_merged_vep_91_GRCh38

variation:
  dbsnp: ../variation/dbsnp-150.vcf.gz
  dbnsfp: ../variation/dbNSFP.txt.gz
  dbscsnv: ../variation/dbscSNV.txt.gz
  train_hapmap: ../variation/hapmap_3.3.vcf.gz
  train_omni: ../variation/1000G_omni2.5.vcf.gz
  train_1000g: ../variation/1000G_phase1.snps.high_confidence.vcf.gz
  train_indels: ../variation/Mills_and_1000G_gold_standard.indels.vcf.gz
  cosmic: ../variation/cosmic.vcf.gz
  lcr: ../coverage/problem_regions/repeats/LCR.bed.gz
  sv_repeat: ../coverage/problem_regions/repeats/sv_repeat_telomere_centromere.bed
  qsignature: ../variation/qsignature.txt
  genesplicer: ../variation/genesplicer

rnaseq:
  transcripts: ../rnaseq/ref-transcripts.gtf
  transcripts_mask: ../rnaseq/ref-transcripts-mask.gtf
  transcriptome_index:
    tophat: ../rnaseq/tophat/hg38_transcriptome.ver
  dexseq: ../rnaseq/ref-transcripts.dexseq.gff3
  refflat: ../rnaseq/ref-transcripts.refFlat
  rRNA_fa: ../rnaseq/rRNA.fa

The snpEff versions do seem to match. Any more ideas?

chapmanb commented 6 years ago

Matthias; Thanks for all this patience debugging. It does look like everything is right with the install and I think the issue might be different paths for finding the files. There is a new approach we've been working on for CWL that avoids needing all of the Galaxy *.loc files and configuration and finds things from the directory structure. This also lets us run on non-local filesystems (DNAnexus, SevenBridges, Arvados) where your reference collection is not on your system. If you add this to your bcbio system YAML file:

local:
  ref: /path/to/bcbio/genomes

does that avoid the problem?

Alternatively if you remove effects: false from your configuration and re-try, that should add in snpEff indices via the older way.

Do either of those work to avoid the issue? Thanks again for all your patience working through these issues.

matthdsm commented 6 years ago

Hi Brad,

I've added the local variable to the systems yaml. This does seem to have resolved the issue for now. The bcbio_vm script has successfully created a tarball of the snpeff dir.

From my point of view, I looks like the script is stuck, since there's no output next to:

$ bcbio_vm.py cwl config/bcbio_cwl.yaml                                                                                                                                       [17:58:51]
[2018-02-27T17:00Z] INFO: Using input YAML configuration: config/bcbio_cwl.yaml
[2018-02-27T17:00Z] INFO: Checking sample YAML configuration: config/bcbio_cwl.yaml

I know now the script is creating the tarballs, but it would be helpful to have a bit more verbosity, so users know it's still actually doing something. Do you suppose it would be possible to add this in?

Thanks for all the help! Much appreciated. I'll let you know if something else come up.

M

matthdsm commented 6 years ago

Hi Brad,

I was able to successfully create the CWL workflow for our config. I'm now trying to run it using the bcbio_vm.py script with the following command: bcbio_vm.py cwlrun toil bcbio_cwl-workflow -- --batchSystem torque --no-container

However, it seems the --no-container flag is ignored, because the script fails with /bin/sh: 1: docker: not found

Is this an issue with the bcbio_vm script or do I open an issue in the toil repo?

Thanks M

chapmanb commented 6 years ago

Matthias; Glad that the change got the CWL compiled and thanks for the feedback on not reporting back during tarring. Hopefully that process will go away when we pre-distribute these genomes but I added a message for now to avoid any confusion in the short term.

For the run, you want the --no-container tag prior to the -- flag (which are flags passed directly to Toil instead of bcbio_vm.py flags:

 bcbio_vm.py cwlrun toil --no-container bcbio_cwl-workflow -- --batchSystem torque

Sorry for the confusion and hope that fixes it for you.

matthdsm commented 6 years ago

Hi Brad,

Thanks for the hint! Seems to work properly without docker now. The jobs keep failing however. I've included the log file. It seems toil is missing some variables when trying to run.

Thanks again for all the help. M

chapmanb commented 6 years ago

Matthias; Thanks for the detailed report. This looks like a system issue writing to the directory you have set as your temporary directory:

C/X/jobaeRng1      File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/tempfile.py", line 339, in mkdtemp
C/X/jobaeRng1        _os.mkdir(file, 0700)
C/X/jobaeRng1    OSError: [Errno 13] Permission denied: '/tmp/bcbio/tmp7kf__S'

Is it possible TMPDIR and other variables are set to /tmp but you don't have access to write there? Toil and other runners use the temporary directories to stage files so you might have to adjust this to match your system. Thanks again for the work debugging this and hope this helps.

matthdsm commented 6 years ago

Hi Brad,

Sorry to have been MIA on this. The issue was indeed a dir on the /tmp drive with wrong permissions. This has since been fixed, but now I've stumbled upon a new issue.

It seems that the jobs toil submits to the queue don't carry all the env variables, especially the custom bcbio $PATH. The PATH variable is reset in the jobs to the default value. This causes all jobs to fail due to missing requirements.

I can see that toil runs with the flag --preserve-environment PATH HOME, which should preserve those values I suppose, but seems to be ignored. I've tried setting export TOIL_TORQUE_ARGS="-V PATH=$PATH" as specified in the Toil docs , but this just causes the qsub command to throw an error.

Any idea what to do?

Thanks M

matthdsm commented 6 years ago

Hi Brad, After adding the missing deps to the default path, I'm seeing a lot of complaints about missing pickle files:

Traceback (most recent call last):
  File "/Shared/Software/bin/_toil_worker", line 11, in <module>
    load_entry_point('toil==3.5.0a1', 'console_scripts', '_toil_worker')()
  File "/Shared/Software/lib/python2.7/site-packages/toil/worker.py", line 100, in main
    jobStore = Toil.resumeJobStore(jobStoreLocator)
  File "/Shared/Software/lib/python2.7/site-packages/toil/common.py", line 767, in resumeJobStore
    jobStore.resume()
  File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 75, in resume
    super(FileJobStore, self).resume()
  File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/abstractJobStore.py", line 149, in resume
    with self.readSharedFileStream('config.pickle') as fileHandle:
  File "/Shared/Software/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 296, in readSharedFileStream
    with open(os.path.join(self.jobStoreDir, sharedFileName), 'r') as f:
IOError: [Errno 20] Not a directory: '/home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/process_alignment.cwl/config.pickle'

I also get job errors stating

Traceback (most recent call last):
  File "/Shared/Software/bin/_toil_worker", line 11, in <module>
    load_entry_point('toil==3.5.0a1', 'console_scripts', '_toil_worker')()
  File "/Shared/Software/lib/python2.7/site-packages/toil/worker.py", line 100, in main
    jobStore = Toil.resumeJobStore(jobStoreLocator)
  File "/Shared/Software/lib/python2.7/site-packages/toil/common.py", line 767, in resumeJobStore
    jobStore.resume()
  File "/Shared/Software/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 73, in resume
    raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/home/projects/bcbio_annotation/test_runs/cwl/toil_work/CWLWorkflow' does not exist, so there is nothing to restart

This does seem oddly intermittent, because there's most of the time 1 job that survives and runs (success!). However, it seems there's a bug here in the BatchSystem API for Torque. The submitted job only reserves 1 core, while using 6 threads. (fixed with a PR)

M

matthdsm commented 6 years ago

ref https://github.com/BD2KGenomics/toil/issues/2111

chapmanb commented 6 years ago

Matthias; Thanks for all this detailed testing. For the PATH issue, @brainstorm worked on passing environmental variables through from the parent environment so it should happen within Torque according to the code:

https://github.com/matthdsm/toil/blob/34c126dc1e9e4d3fce1a795cae0656cc266fb7bd/src/toil/batchSystems/torque.py#L173

I've noticed that you have an old version of Toil referenced (toil==3.5.0a1) which might not include these. Is it possible your install doesn't have the latest release (3.14.0) so not include these fixes? Could you try doing:

bcbiovm_conda install -c conda-forge -c bioconda toil

to see if that helps? That also might help with the intermittent file issue. Thanks for the Torque system patch and for all this testing work.

matthdsm commented 6 years ago

Hi Brad,

Thanks for all the help. I've updated everything I could find, including the bcbio_vm.py script, toil, everything. I ran into an intermediate issue because of a Toil dependency on an older version of dill, but that should be fixed now too.

The missing dir issue hasn't been resolved sadly. I've tried regenerating the CWL configs, removing and creating directories etc, to no avail.

I've included the bcbio_cwl-toil.log file again. Perhaps you can make something out if it. I really hope so. Also, the issue with the wrong PATH hasn't been resolved with the update.

Thanks again,

Cheers M bcbio_cwl-toil.log toil_job_1.e46946.txt

chapmanb commented 6 years ago

Matthias -- thanks for following up. The latest run does look to be using the latest toil, thanks for updating.

Could you try removing the problem toil directory (/home/projects/bcbio_annotation/test_runs/cwl/toil_work/) and restarting from scratch? It seems like the run got in a bad state it's not able to recover from. I'm hoping a fresh run might work cleaner with the latest toil and avoid these issues. Thanks again for the help debugging.

matthdsm commented 6 years ago

Hi Brad,

Tried that already, doesn't help :-(. Any more ideas? I'm really at a loss what could be wrong.

Thanks M

chapmanb commented 6 years ago

Matthias; Sorry, I'm not exactly sure what to suggest at this point. Some suggestions:

I know Toil has some rough edges right now so it would be useful if we could identify a reproducible error case and try to fix the underlying issue. Thanks again fo the help debugging.

matthdsm commented 6 years ago

Hi Brad,

I just ran a test on a single machine using toil, which failed again. I got the following report:

login 2018-03-13 09:05:10,074 MainThread INFO toil.leader: Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn p
rep_samples_to_rec cwl u/a/job9lHKQm
2018-03-13 09:05:10,074 - toil.leader - INFO - Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_
to_rec cwl u/a/job9lHKQm   
login 2018-03-13 09:08:58,472 MainThread INFO toil.leader: Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alig
nment_to_rec cwl A/Q/jobTDaCja
2018-03-13 09:08:58,472 - toil.leader - INFO - Job ended successfully: 'file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl A/Q/jobTDaCja
login 2018-03-13 09:08:58,516 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...
2018-03-13 09:08:58,516 - toil.serviceManager - INFO - Waiting for service manager thread to finish ...
login 2018-03-13 09:08:59,276 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 0.75955915451 seconds
2018-03-13 09:08:59,276 - toil.serviceManager - INFO - ... finished shutting down the service manager. Took 0.75955915451 seconds
login 2018-03-13 09:08:59,277 MainThread INFO toil.statsAndLogging: Waiting for stats and logging collator thread to finish ...
2018-03-13 09:08:59,277 - toil.statsAndLogging - INFO - Waiting for stats and logging collator thread to finish ...
login 2018-03-13 09:09:02,411 MainThread INFO toil.statsAndLogging: ... finished collating stats and logs. Took 3.13325619698 seconds
2018-03-13 09:09:02,411 - toil.statsAndLogging - INFO - ... finished collating stats and logs. Took 3.13325619698 seconds
Traceback (most recent call last):
  File "/home/galaxy/bcbio-dev/anaconda/bin/cwltoil", line 11, in <module>
    load_entry_point('toil==3.14.0', 'console_scripts', 'cwltoil')()
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 990, in main
    outobj = toil.start(wf1)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 743, in start
    return self._runMainLoop(rootJobGraph)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 1018, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 197, in run
    self.innerLoop()
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 400, in innerLoop
    self.issueJobs(successors)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 571, in issueJobs
    self.issueJob(job)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 550, in issueJob
    jobBatchSystemID = self.batchSystem.issueBatchJob(jobNode)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/batchSystems/singleMachine.py", line 230, in issueBatchJob
    'set to {}.'.format(jobNode.jobName, cores, self.maxCores, self.scale))
AssertionError: The job file:///home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/steps/prep_align_inputs.cwl is requesting 6.0 cores, more than the maximum of 2 cores this batch system was configured with. Scale is set to 1.
Traceback (most recent call last):
  File "/home/galaxy/bcbio-dev/anaconda/bin/bcbio_vm.py", line 334, in <module>
    args.func(args)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 177, in run
    _TOOLS[args.tool](args)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 108, in _run_toil
    _run_tool(cmd, not args.no_container, work_dir)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 38, in _run_tool
    subprocess.check_call(cmd, shell=True)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'export PATH=$PATH:/home/galaxy/bcbio-dev/anaconda/bin && cwltoil --jobStore /home/projects/bcbio_annotation/test_runs/cwl/toil_work/cwltoil_jobstore --logFile /home/projects/bcbio_annotation/test_runs/cwl/toil_work/bcbio_cwl-toil.log --workDir /home/projects/bcbio_annotation/test_runs/cwl/toil_work/tmpdir --linkImports --maxCores 2 --no-container --preserve-environment PATH HOME -- /home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/main-bcbio_cwl.cwl /home/projects/bcbio_annotation/test_runs/cwl/bcbio_cwl-workflow/main-bcbio_cwl-samples.json' returned non-zero exit status 1

Which seems to me like it's an issue with the resource requests. Looks strange to me. I started toil using bcbio_vm.py cwlrun toil --no-container bcbio_cwl-workflow -- --maxCores 2. Tests omitting the --maxCores flag throw the same error.

Next step is a test with the pregenerated CWL from your repo. Cheers M

matthdsm commented 6 years ago

And sadly, again an when running the somatic workflow from test_bcbio_cwl. somatic-toil.log

Traceback:

2018-03-13 09:49:24,407 - toil.leader - INFO - Finished toil run with 5 failed jobs
login 2018-03-13 09:49:24,408 MainThread INFO toil.leader: Failed jobs at end of the run: 'CWLGather' 0/W/jobN7a6pc 'CWLWorkflow' c/K/joblkKODA 'CWLScatter' q/f/jobBZPvWo 'file:///home/projects/bcbio_annotation/
test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_to_rec cwl l/a/job2UuziU 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/soma
tic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
2018-03-13 09:49:24,408 - toil.leader - INFO - Failed jobs at end of the run: 'CWLGather' 0/W/jobN7a6pc 'CWLWorkflow' c/K/joblkKODA 'CWLScatter' q/f/jobBZPvWo 'file:///home/projects/bcbio_annotation/test_runs/te
st_bcbio_cwl/somatic/somatic-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_to_rec cwl l/a/job2UuziU 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-
workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
Traceback (most recent call last):
  File "/home/galaxy/bcbio-dev/anaconda/bin/cwltoil", line 11, in <module>
    load_entry_point('toil==3.14.0', 'console_scripts', 'cwltoil')()
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 990, in main
    outobj = toil.start(wf1)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 743, in start
    return self._runMainLoop(rootJobGraph)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/common.py", line 1018, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/leader.py", line 226, in run
    raise FailedJobsException(self.config.jobStore, self.toilState.totalFailedJobs, self.jobStore)
toil.leader.FailedJobsException: The job store 'file:/home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/cwltoil_jobstore' contains 5 failed jobs: 'CWLGather' 0/W/jobN7a6pc, 'CWLWorkflow' 
c/K/joblkKODA, 'CWLScatter' q/f/jobBZPvWo, 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/prep_samples_to_rec.cwl' bcbio_nextgen.py runfn prep_samples_to_rec cwl 
l/a/job2UuziU, 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
=========> Failed job 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_nextgen.py runfn alignment_to_rec cwl x/T/job6f0j0L
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.14.0-b91dbf9bf6116879952f0a70f9a2fbbcae7e51b6.
WARNING:toil.fileStore:Starting job c/K/joblkKODA/g/tmpCRN3Tr.tmp with less than 10% of disk space remaining.
Traceback (most recent call last):
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/worker.py", line 309, in workerScript
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/job.py", line 1328, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/job.py", line 1273, in _run
    return self.run(fileStore)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 462, in run
    (output, status) = cwltool.main.single_job_executor(**opts)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/cwltool/main.py", line 298, in single_job_executor
    r.run(**kwargs)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/cwltool/job.py", line 286, in run
    self._setup(kwargs)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/cwltool/job.py", line 158, in _setup
    "file." % (knownfile, self.pathmapper.mapper(knownfile)[0]))
WorkflowException: Input file file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/testdata/100326_FC6107FAAXX/7_100326_FC6107FAAXX_1.fq.gz.gbi (at /home/projects/bcbio_annotation/test_runs/test_bcbio
_cwl/testdata/100326_FC6107FAAXX/7_100326_FC6107FAAXX_1.fq.gz.gbi) not found or is not a regular file.
ERROR:toil.worker:Exiting the worker because of a failed job on host login
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'file:///home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/steps/alignment_to_rec.cwl' bcbio_n
extgen.py runfn alignment_to_rec cwl x/T/job6f0j0L with ID x/T/job6f0j0L to 0
<=========

Traceback (most recent call last):
  File "/home/galaxy/bcbio-dev/anaconda/bin/bcbio_vm.py", line 334, in <module>
    args.func(args)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 177, in run
    _TOOLS[args.tool](args)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 108, in _run_toil
    _run_tool(cmd, not args.no_container, work_dir)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 38, in _run_tool
    subprocess.check_call(cmd, shell=True)
  File "/home/galaxy/bcbio-dev/anaconda/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'export PATH=$PATH:/home/galaxy/bcbio-dev/anaconda/bin && cwltoil --jobStore /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/cwltoil_jobstore --l
ogFile /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/somatic-toil.log --workDir /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/toil_work/tmpdir --linkImports --no-c
ontainer --preserve-environment PATH HOME -- /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/somatic-workflow/main-somatic.cwl /home/projects/bcbio_annotation/test_runs/test_bcbio_cwl/somatic/so
matic-workflow/main-somatic-samples.json' returned non-zero exit status 1
matthdsm commented 6 years ago

On a better note, both workflows seem to work with Bunny, so I suppose the issue lies with toil.

Cheers M

ohofmann commented 6 years ago

@matthdsm This all reads fairly familiar. We've tried for a long time to get the bcbio/CWL/Toil combination to work for different HPC environments (SLURM and PBSPro) but have put the effort on hold for now. Between the layers of stacks involved it became a nightmare to track down issues; maybe Toil debugging support has improved by now, and I'm sure it's mostly user errors on our side, but for the time being we're looking at Arvados (and hopefully Cromwell down the road) to give this another try.

matthdsm commented 6 years ago

Hi Oliver,

Sorry to hear you haven't had any luck with toil too. At the moment, Arvados isn't really on the roadmap for us. We are however, very interested in Cromwell and Funnel. @chapmanb I saw in the bcbio_vm repo that you're working on funnel integration? What's the status and prerequisites of that?

M

matthdsm commented 6 years ago

Perhaps @ejacox or @DailyDreaming would like to add their two cents on this?

M

DailyDreaming commented 6 years ago

@matthdsm We've had low bandwidth recently for HPC and have been focusing on cloud provisioning. I'd love to see more done with CWL on HPC, as I see that as a great userbase that could become a great community, but the current directives we've been given are for robust provisioning in aws, google, and azure.

That said, I've seen a similar error before using cwltoil and on first inspection (at least previously) it seemed to be a problem with how cwltool was talking with toil (currently toil breaks up cwl workflows into chunks and passes them to cwltool to process as jobs so they pass filepaths back and forth; toil deletes things constantly though and I'm not sure the cwltool pathmapper always returns a path that makes sense to toil, especially with glob). That's my 2 cents, though it may be way off the mark, and this might be a different problem entirely.

Once we're done with the cloud provisioning I'd definitely like to come back and spend some serious time updating cwl support (and debugging in general).

chapmanb commented 6 years ago

Thanks all for this great discussion. We're definitely still interested in having Toil running on HPC but are waiting patiently until it's also a priority for the team there. Both it and improved debugging would be a huge help to make Toil CWL a first class citizen there, but I know they've been busy with Cloud support so am happy to explore this more when the team is ready. We can run Toil workflows on HPC but the major issue is debugging and unblocking when things don't work, as Matthias is hitting.

The funnel support in bcbio was only exploratory and isn't expected to work yet. This would be a cool direction to explore more, but I don't know all the ins and outs of getting funnel running on an HPC system. Matthias, if you look into this I'd love to hear your feedback.

As Oliver mentioned, Arvados CWL works great and we've scaled that to 1000s of samples on Arvados clusters. I know this is a bigger install but an option if you end up exploring that direction.

We're hopeful Cromwell will be available soon for testing with bcbio CWL, and definitely plan to work on that as soon as it's available. The team has been doing a ton of hard work supporting CWL and use Cromwell on HPC systems so it fits their development efforts and our needs well.

Sorry we don't have a single solution here right now but hope that helps explain the current state and directions. Thanks again for this great discussion.

matthdsm commented 6 years ago

Hi Brad, I think we're going to put Toil on the backburner until the team is ready to throw some time at it. We're going to be exploring Funnel a bit, as it's part of the GA4GH ref implementation. If I come across something, I'll let you know.

I'll be keeping an eye on CWL support on Cromwell, but could you give me a ping when you think everything's good to go on the bcbio side?

Thanks everyone for all the ongoing help and development.

Cheers M