Errors with CWL and joint VC

amizeranschi commented 5 years ago

Hello,

I'm noticing some errors when running joint calling with multiple variant callers and ensemble mode. The input files are the same as here: https://github.com/bcbio/bcbio-nextgen/issues/2688#issue-412394726

The difference is in the variant callers setup and with using CWL with a local bcbio_nextgen instead of directly running things in bcbio_nextgen. I'm really interested in the ensemble multi-VC scenario with joint genotyping.

CWL ran fine with one VC (so far tested with HaplotypeCaller and Strelka2). When using both callers and enabling ensemble mode (numpass: 2), I get the following error:

Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_nextgen.py", line 223, in <module>
    runfn.process(kwargs["args"])
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 57, in process
    out = fn(*fnargs)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 348, in batch_for_ensemble
    return ensemble.batch(*args)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/ensemble.py", line 32, in batch
    samples = [utils.to_single_data(x) for x in samples]
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 157, in to_single_data
    assert isinstance(input, dict), input
AssertionError: [{'dirs': {'work': '/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/execution'}, 'rgnames': {'sample': u'HG00096'}, u'vrn_file': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-838467116/HG00096.vcf.gz', u'description': u'HG00096', u'reference': {u'snpeff': {u'GRCh38_86': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/852345369/snpeff--GRCh38.86-wf.tar.gz'}, u'genome_context': [u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1262387051/LCR.bed.gz', u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1262387051/polyx.bed.gz', u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1262387051/rmsk.gtf.gz'], u'fasta': {u'base': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/34938063/hg38.fa'}, u'rtg': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/852345369/rtg--hg38.sdf-wf.tar.gz'}, u'batch_samples': [u'HG00096'], 'output_cwl_keys': {'ensemble_prep_rec': ['batch_id', 'variants__calls', 'variants__variantcallers', 'resources', 'description', 'batch_samples', 'validate__summary', 'validate__tp', 'validate__fp', 'validate__fn', 'vrn_file', 'metadata__phenotype', 'genome_resources__variation__1000g', 'genome_resources__variation__train_hapmap', 'config__algorithm__min_allele_fraction', 'config__algorithm__validate_regions', 'config__algorithm__tools_on', 'config__algorithm__variant_regions', 'config__algorithm__variantcaller', 'reference__snpeff__GRCh38_86', 'genome_resources__variation__clinvar', 'reference__genome_context', 'metadata__batch', 'genome_build', 'genome_resources__aliases__human', 'genome_resources__variation__encode_blacklist', 'genome_resources__variation__dbsnp', 'config__algorithm__ensemble', 'config__algorithm__effects', 'genome_resources__aliases__ensembl', 'config__algorithm__exclude_regions', 'reference__fasta__base', 'genome_resources__variation__exac', 'genome_resources__variation__gnomad_exome', 'config__algorithm__coverage_interval', 'genome_resources__variation__polyx', 'genome_resources__variation__cosmic', 'config__algorithm__vcfanno', 'genome_resources__aliases__snpeff', 'reference__rtg', 'genome_resources__variation__lcr', 'genome_resources__variation__esp', 'config__algorithm__validate', 'config__algorithm__tools_off', 'analysis', 'genome_resources__variation__train_indels', 'config__algorithm__variant_regions_merged', 'regions__sample_callable', 'config__algorithm__callable_regions', 'vrn_file_joint']}, u'vrn_file_joint': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1051353122/testingVC-annotated-nomissingalt-filterSNP-filterINDEL.vcf.gz', u'analysis': u'variant2', u'regions': {u'sample_callable': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1174113760/HG00096-sort-callable_sample.bed'}, u'genome_build': u'hg38', 'cwl_keys': [u'metadata__phenotype', u'config__algorithm__validate', u'reference__fasta__base', u'reference__rtg', u'genome_resources__variation__lcr', u'vrn_file_joint', u'validate__summary', u'genome_resources__variation__exac', u'reference__snpeff__GRCh38_86', u'genome_resources__variation__1000g', u'regions__sample_callable', u'config__algorithm__coverage_interval', u'genome_resources__variation__train_hapmap', u'validate__fp', u'genome_resources__variation__clinvar', u'genome_resources__variation__esp', u'reference__genome_context', u'metadata__batch', u'batch_samples', u'config__algorithm__min_allele_fraction', u'validate__tp', u'validate__fn', u'resources', u'config__algorithm__callable_regions', u'description', u'config__algorithm__validate_regions', u'config__algorithm__variantcaller', u'genome_build', u'config__algorithm__exclude_regions', u'genome_resources__aliases__human', u'genome_resources__variation__encode_blacklist', u'config__algorithm__tools_off', u'genome_resources__variation__dbsnp', u'vrn_file', u'genome_resources__variation__polyx', u'genome_resources__variation__cosmic', u'config__algorithm__ensemble', u'config__algorithm__vcfanno', u'analysis', u'config__algorithm__variant_regions_merged', u'config__algorithm__tools_on', u'config__algorithm__effects', u'config__algorithm__variant_regions', u'genome_resources__aliases__ensembl', u'genome_resources__variation__gnomad_exome', u'genome_resources__variation__train_indels', u'genome_resources__aliases__snpeff'], u'genome_resources': {u'variation': {u'clinvar': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/747486603/clinvar.vcf.gz', u'polyx': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1262387051/polyx.bed.gz', u'lcr': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1262387051/LCR.bed.gz', u'encode_blacklist': None, u'esp': None, u'exac': None, u'cosmic': None, u'gnomad_exome': None, u'1000g': None, u'train_hapmap': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/747486603/hapmap_3.3.vcf.gz', u'dbsnp': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/747486603/dbsnp-151.vcf.gz', u'train_indels': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/747486603/Mills_and_1000G_gold_standard.indels.vcf.gz'}, u'aliases': {u'snpeff': u'GRCh38.86', u'human': True, u'ensembl': u'homo_sapiens_merged_vep_95_GRCh38'}}, u'validate': {u'fp': None, u'fn': None, u'tp': None, u'summary': None}, u'config': {'resources': {u'default': {'cores': 1, 'jvm_opts': ['-Xms1000m', '-Xmx3072m'], 'memory': '3072M'}}, u'algorithm': {u'exclude_regions': [], u'min_allele_fraction': 10, 'num_cores': 1, u'vcfanno': [], u'variantcaller': u'gatk-haplotype', u'validate_regions': None, u'tools_off': [u'gemini', u'contamination', u'peddy'], u'variant_regions_merged': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-2011195376/cleaned-variant_regions-merged.bed', u'ensemble': u'{"numpass": 2}', u'variant_regions': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-2011195376/cleaned-variant_regions.bed', u'coverage_interval': u'amplicon', u'tools_on': [u'gvcf'], u'validate': None, u'callable_regions': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1986227072/testingVC-analysis_blocks.bed', u'effects': False}}, u'resources': {u'default': {'cores': 1, 'jvm_opts': ['-Xms1000m', '-Xmx3072m'], 'memory': '3072M'}}, u'metadata': {u'phenotype': u'', u'batch': u'testingVC'}}, {'dirs': {'work': '/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/execution'}, 'rgnames': {'sample': u'HG00097'}, u'vrn_file': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/1534265734/HG00097.vcf.gz', u'description': u'HG00097', u'reference': {u'snpeff': {u'GRCh38_86': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/1745172351/snpeff--GRCh38.86-wf.tar.gz'}, u'genome_context': [u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-475581783/LCR.bed.gz', u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-475581783/polyx.bed.gz', u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-475581783/rmsk.gtf.gz'], u'fasta': {u'base': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1455071080/hg38.fa'}, u'rtg': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/1745172351/rtg--hg38.sdf-wf.tar.gz'}, u'batch_samples': [u'HG00097'], 'output_cwl_keys': {'ensemble_prep_rec': ['batch_id', 'variants__calls', 'variants__variantcallers', 'resources', 'description', 'batch_samples', 'validate__summary', 'validate__tp', 'validate__fp', 'validate__fn', 'vrn_file', 'metadata__phenotype', 'genome_resources__variation__1000g', 'genome_resources__variation__train_hapmap', 'config__algorithm__min_allele_fraction', 'config__algorithm__validate_regions', 'config__algorithm__tools_on', 'config__algorithm__variant_regions', 'config__algorithm__variantcaller', 'reference__snpeff__GRCh38_86', 'genome_resources__variation__clinvar', 'reference__genome_context', 'metadata__batch', 'genome_build', 'genome_resources__aliases__human', 'genome_resources__variation__encode_blacklist', 'genome_resources__variation__dbsnp', 'config__algorithm__ensemble', 'config__algorithm__effects', 'genome_resources__aliases__ensembl', 'config__algorithm__exclude_regions', 'reference__fasta__base', 'genome_resources__variation__exac', 'genome_resources__variation__gnomad_exome', 'config__algorithm__coverage_interval', 'genome_resources__variation__polyx', 'genome_resources__variation__cosmic', 'config__algorithm__vcfanno', 'genome_resources__aliases__snpeff', 'reference__rtg', 'genome_resources__variation__lcr', 'genome_resources__variation__esp', 'config__algorithm__validate', 'config__algorithm__tools_off', 'analysis', 'genome_resources__variation__train_indels', 'config__algorithm__variant_regions_merged', 'regions__sample_callable', 'config__algorithm__callable_regions', 'vrn_file_joint']}, u'vrn_file_joint': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1051353122/testingVC-annotated-nomissingalt-filterSNP-filterINDEL.vcf.gz', u'analysis': u'variant2', u'regions': {u'sample_callable': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-851074267/HG00097-sort-callable_sample.bed'}, u'genome_build': u'hg38', 'cwl_keys': [u'metadata__phenotype', u'config__algorithm__validate', u'reference__fasta__base', u'reference__rtg', u'genome_resources__variation__lcr', u'vrn_file_joint', u'validate__summary', u'genome_resources__variation__exac', u'reference__snpeff__GRCh38_86', u'genome_resources__variation__1000g', u'regions__sample_callable', u'config__algorithm__coverage_interval', u'genome_resources__variation__train_hapmap', u'validate__fp', u'genome_resources__variation__clinvar', u'genome_resources__variation__esp', u'reference__genome_context', u'metadata__batch', u'batch_samples', u'config__algorithm__min_allele_fraction', u'validate__tp', u'validate__fn', u'resources', u'config__algorithm__callable_regions', u'description', u'config__algorithm__validate_regions', u'config__algorithm__variantcaller', u'genome_build', u'config__algorithm__exclude_regions', u'genome_resources__aliases__human', u'genome_resources__variation__encode_blacklist', u'config__algorithm__tools_off', u'genome_resources__variation__dbsnp', u'vrn_file', u'genome_resources__variation__polyx', u'genome_resources__variation__cosmic', u'config__algorithm__ensemble', u'config__algorithm__vcfanno', u'analysis', u'config__algorithm__variant_regions_merged', u'config__algorithm__tools_on', u'config__algorithm__effects', u'config__algorithm__variant_regions', u'genome_resources__aliases__ensembl', u'genome_resources__variation__gnomad_exome', u'genome_resources__variation__train_indels', u'genome_resources__aliases__snpeff'], u'genome_resources': {u'variation': {u'clinvar': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1273576180/clinvar.vcf.gz', u'polyx': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-475581783/polyx.bed.gz', u'lcr': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-475581783/LCR.bed.gz', u'encode_blacklist': None, u'esp': None, u'exac': None, u'cosmic': None, u'gnomad_exome': None, u'1000g': None, u'train_hapmap': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1273576180/hapmap_3.3.vcf.gz', u'dbsnp': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1273576180/dbsnp-151.vcf.gz', u'train_indels': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-1273576180/Mills_and_1000G_gold_standard.indels.vcf.gz'}, u'aliases': {u'snpeff': u'GRCh38.86', u'human': True, u'ensembl': u'homo_sapiens_merged_vep_95_GRCh38'}}, u'validate': {u'fp': None, u'fn': None, u'tp': None, u'summary': None}, u'config': {'resources': {u'default': {'cores': 1, 'jvm_opts': ['-Xms1000m', '-Xmx3072m'], 'memory': '3072M'}}, u'algorithm': {u'exclude_regions': [], u'min_allele_fraction': 10, 'num_cores': 1, u'vcfanno': [], u'variantcaller': u'gatk-haplotype', u'validate_regions': None, u'tools_off': [u'gemini', u'contamination', u'peddy'], u'variant_regions_merged': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-265338074/cleaned-variant_regions-merged.bed', u'ensemble': u'{"numpass": 2}', u'variant_regions': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/-265338074/cleaned-variant_regions.bed', u'coverage_interval': u'amplicon', u'tools_on': [u'gvcf'], u'validate': None, u'callable_regions': u'/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/1663c452-f25c-407a-850b-949763fdafd3/call-batch_for_ensemble/inputs/944175185/testingVC-analysis_blocks.bed', u'effects': False}}, u'resources': {u'default': {'cores': 1, 'jvm_opts': ['-Xms1000m', '-Xmx3072m'], 'memory': '3072M'}}, u'metadata': {u'phenotype': u'', u'batch': u'testingVC'}}]

When trying to run the same analysis with 4 callers ([gatk-haplotype, strelka2, freebayes, samtools]) and the corresponding jointcallers and ensemble mode (numpass: 3), the outcome is a different error:

Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_nextgen.py", line 223, in <module>
    runfn.process(kwargs["args"])
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 57, in process
    out = fn(*fnargs)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 396, in run_jointvc
    return joint.run_jointvc(*args)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/joint.py", line 64, in run_jointvc
    joint_out = square_batch_region(data, ready_region, [], [d["vrn_file"] for d in items], out_file)[0]
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/joint.py", line 179, in square_batch_region
    _square_batch_bcbio_variation(data, region, bam_files, vrn_files, out_file, "square")
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/joint.py", line 235, in _square_batch_bcbio_variation
    do.run(cmd, "%s in region: %s" % (cmd, bamprep.region_to_gatk(region)), env=bcbio_env)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 26, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 106, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; bcbio-variation-recall square -Xms1000m -Xmx3072m -XX:+UseSerialGC -c 1 -r chr7:1-141973873 --caller samtools /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/5d82031f-4018-45c9-9e35-99a395d2a24a/call-jointcall/shard-3/wf-jointcall.cwl/fadbfa37-c1d1-4689-a3d1-c5aa3fd4b415/call-run_jointvc/shard-0/execution/joint/samtools/chr7_1-141973873/testingVC-samtools-chr7_1-141973873.vcf.gz /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/5d82031f-4018-45c9-9e35-99a395d2a24a/call-jointcall/shard-3/wf-jointcall.cwl/fadbfa37-c1d1-4689-a3d1-c5aa3fd4b415/call-run_jointvc/shard-0/inputs/-1010179002/hg38.fa /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/5d82031f-4018-45c9-9e35-99a395d2a24a/call-jointcall/shard-3/wf-jointcall.cwl/fadbfa37-c1d1-4689-a3d1-c5aa3fd4b415/call-run_jointvc/shard-0/execution/joint/samtools/chr7_1-141973873/testingVC-samtools-chr7_1-141973873.vcf-inputs.txt
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/5d82031f-4018-45c9-9e35-99a395d2a24a/call-jointcall/shard-3/wf-jointcall.cwl/fadbfa37-c1d1-4689-a3d1-c5aa3fd4b415/call-run_jointvc/shard-0/tmp.974eea3b
               [37mbcbio.variation.recall.main.main[m  [32m               [m
                                            [37m...[m  [32m               [m
              [33mbcbio.variation.recall.main/[1;33m-main[m  [32m  main.clj:  30[m
              [33mbcbio.variation.recall.main/[1;33m-main[m  [32m  main.clj:  33[m
           [33mbcbio.variation.recall.main/-main/[1;33mfn[m  [32m  main.clj:  34[m
                             [33mclojure.core/[1;33mapply[m  [32m  core.clj: 665[m
                                            [37m...[m  [32m               [m
            [33mbcbio.variation.recall.square/[1;33m-main[m  [32msquare.clj: 315[m
            [33mbcbio.variation.recall.square/[1;33m-main[m  [32msquare.clj: 343[m
     [33mbcbio.variation.recall.square/[1;33mcombine-vcfs[m  [32msquare.clj: 290[m
[33mbcbio.variation.recall.square/[1;33msample-to-bam-map[m  [32msquare.clj: 259[m
                    [33mbcbio.run.fsp/[1;33madd-file-part[m  [32m   fsp.clj:  35[m
                       [33mbcbio.run.fsp/[1;33msplit-ext+[m  [32m   fsp.clj:  14[m
[1;31mjava.lang.NullPointerException[m: [3m[m
' returned non-zero exit status 1

naumenko-sa commented 5 years ago

Hi @amizeranschi!

I think joint genotyping of many samples and ensemble calling are two different stories. You either do joint genotyping for > 100 samples with gatk or freebayes, or combine callers for a few samples maximum. With many variant callers you can tune precision/sensitivity when you are working with individual samples or trios. Joint calling is for very large cohorts - here you are using just one tool to call, and then combine calls with joint genotyping.

Sergey

amizeranschi commented 5 years ago

Hi Sergey,

Thanks for your answer, but I'm not sure I understand 100%. Why can't we have both joint genotyping and creating an ensemble VCF file as well, at the end?

Would it not make sense to run joint genotyping with two callers (e.g. gatk and freebayes), produce a joint VCF file from each caller, and then create a consensus (ensemble) VCF file from those single-caller VCF files? Wouldn't this be similar to what was done in the past with ensemble calling on pooled samples, before joint genotyping was introduced for large cohorts (>100 samples)?

amizeranschi commented 5 years ago

I forgot to add, this setup (joint VC with ensemble VCF creation at the end) is already working in bcbio when not using CWL. The errors reported here only seem to happen when using CWL.

The only apparent problem with joint genotyping + ensemble calling (when not using CWL) is that the VCF flag CALLERS also lists the joint callers along with the variant callers (e.g. strelka2-joint along with strelka2), as I mentioned in https://github.com/bcbio/bcbio-nextgen/issues/2688#issue-412394726. It would be great if that could be fixed as well, so that the joint callers don't get listed.

chapmanb commented 5 years ago

Thanks for this really great discussion. I'm agreed with Sergey's assessment that joint calling + ensemble is not a current focus of bcbio. Practically, I'd recommend sticking with a single joint calling germline approach with GATK4 for a few reasons:

The GATK4 approach is well validated, performant and freely available. We haven't been working on the other joint methods, which were more useful in the past when GATK was restricted access.
If you have enough samples to joint call with, the development focus should be on making that as fast and scalable as possible.
It's very difficult to find good large datasets for validation of joint ensemble calling. We don't currently have a good sense whether a joint approach helps and the ensemble method hasn't been validated and tuned for large populations.

Is ensemble joint calling providing variants that a single GATK4 run is missing? If so, it would be nice to formalize this and figure out how best to support and improve it. Thanks again for the thoughts and suggestions.

amizeranschi commented 5 years ago

Thanks for the reply. In past experiments I have found variants that seemed to be true positives and were missed by individual tools (incl. HaplotypeCaller), but found by others and an ensemble approach with e.g. 3 out of 5 tools produced a better set of high-confidence variants compared with what each individual variant caller came up with. I'm afraid I don't have a reference or any more details about this at the moment, but this is the reason I was interested in the multi-VC, ensemble approach.

Would there be a need for additional validation to the ensemble approach when using joint calling instead of the old population calling? What would this require, more exactly?

I see your point about the runtime and scalability issues when calling variants on a large numbers of samples with multiple tools. However, I was hoping to leave this worry up to the researchers themselves. For those with enough time and resources on their hands, an ensemble approach might just provide a set of higher-confidence variants when identified with multiple tools, compared to putting all your "trust" into a single tool.

amizeranschi commented 5 years ago

As a side question, what would be the limit in terms of number of samples, for using population calling instead of joint calling? The bcbio documentation states Joint calling is only needed for larger input sample sizes (>100 samples), otherwise use standard pooled Population calling.

Would bcbio using CWL be able to parallelize population calling for a large number of samples, using the two settings nomap_split_size and nomap_split_targets? What could be a good strategy for doing this? And would ensemble, multi-tool calling still work with standard population calling instead of joint calling?

chapmanb commented 5 years ago

Thanks for all this helpful discussion. I definitely agree that there is potential benefit to ensemble methods. So far we haven't had a good dataset with truth calls to demonstrate that it helps enough to overcome the investment in both computational time running multiple callers and development time in tuning and tweaking ensemble calling on large runs. Ensemble output also has the practical downside of having non-harmonized VCFs since the calls come from multiple callers, which requires additional work. We just haven't had the time yet to validate and confirm larger population ensemble calling with CWL; hope this helps explain the current state.

Pooled calling might work better practically for ensemble if you don't have very large sample sizes. It does suffer from the same scientific issue of not being tuned and optimized, since we've mostly focused ensemble testing on smaller pools where you don't have the benefit of informing from a larger sample population as part of the calling algorithm.

Sorry to not have this finished and fully validated, and thanks again for the discussion.

roryk commented 5 years ago

Thanks everyone! Looks like the suggestion to stick with joint calling with GATK is the way to go here. Please reopen if this isn't going to work for you all.

bcbio / bcbio-nextgen

Errors with CWL and joint VC #2689