bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

VarScan(?) failing at NA12878 #186

Closed mjafin closed 10 years ago

mjafin commented 10 years ago

I noticed a few problems when including VarScan in calling variants in the NA12878 example. The errors look like this:

[2013-11-28 03:31] ukapdlnx118: Genotyping with varscan: ('6', 97599365, 99286786) NA12878-1-sort-6_97599365_99286786-prep.bam
[2013-11-28 03:32] ukapdlnx118: Genotyping with varscan: ('6', 99321863, 100895524) NA12878-1-sort-6_99321863_100895524-prep.bam
[2013-11-28 03:32] ukapdlnx118: Unexpected error
Traceback (most recent call last):
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 33, in _setup_logging
    yield None
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 139, in variantcall_sample
    return apply(genotype.variantcall_sample, *args)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 477, in variantcall_sample
    region, call_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 67, in unified_genotyper
    region, out_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 40, in _shared_gatk_call_prep
    region = subset_variant_regions(variant_regions, region, out_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/pipeline/shared.py", line 145, in subset_variant_regions
    _subset_bed_by_region(variant_regions, tx_subset_file, safe_region)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/pipeline/shared.py", line 124, in _subset_bed_by_region
    orig_bed.intersect(region_bed).saveas(out_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 623, in decorated
    result = method(self, *args, **kwargs)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 216, in wrapped
    check_stderr=check_stderr)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/helpers.py", line 429, in call_bedtools
    raise OSError('See above for commands that gave the error')
OSError: See above for commands that gave the error
[2013-11-28 03:32] ukapdlnx118: Genotyping with varscan: ('6', 100895783, 102483672) NA12878-1-sort-6_100895783_102483672-prep.bam
[2013-11-28 03:32] ukapdlnx118: Genotyping with varscan: ('6', 102502921, 105177845) NA12878-1-sort-6_102502921_105177845-prep.bam
[2013-11-28 03:32] ukapdlnx118: Genotyping with varscan: ('6', 105177942, 106741221) NA12878-1-sort-6_105177942_106741221-prep.bam
[2013-11-28 03:33] ukapdlnx118: Genotyping with varscan: ('6', 106755983, 108366277) NA12878-1-sort-6_106755983_108366277-prep.bam
[2013-11-28 03:34] ukapdlnx118: Genotyping with varscan: ('6', 108370222, 109926087) NA12878-1-sort-6_108370222_109926087-prep.bam
[2013-11-28 03:34] ukapdlnx118: Genotyping with varscan: ('6', 109926688, 111494284) NA12878-1-sort-6_109926688_111494284-prep.bam
[2013-11-28 03:35] ukapdlnx118: Genotyping with varscan: ('6', 111498123, 113544042) NA12878-1-sort-6_111498123_113544042-prep.bam
[2013-11-28 03:36] ukapdlnx118: Genotyping with varscan: ('6', 113835894, 116264566) NA12878-1-sort-6_113835894_116264566-prep.bam
[2013-11-28 03:36] ukapdlnx118: Unexpected error
Traceback (most recent call last):
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 33, in _setup_logging
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 139, in variantcall_sample
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 477, in variantcall_sample
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 67, in unified_genotyper
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 40, in _shared_gatk_call_prep
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/pipeline/shared.py", line 145, in subset_variant_regions
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/pipeline/shared.py", line 124, in _subset_bed_by_region
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 623, in decorated
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 216, in wrapped
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/helpers.py", line 429, in call_bedtools
OSError: See above for commands that gave the error

In the debug log here's what's around the error:

[2013-11-28 03:32] ukapdlnx118: Varscan
[2013-11-28 03:32] ukapdlnx118: Got the following sample list:
[2013-11-28 03:32] ukapdlnx118: NA12878-1
[2013-11-28 03:32] ukapdlnx118: Only variants will be reported
[2013-11-28 03:32] ukapdlnx118: Min coverage:   5
[2013-11-28 03:32] ukapdlnx118: Min reads2:     2
[2013-11-28 03:32] ukapdlnx118: Min var freq:   0.2
[2013-11-28 03:32] ukapdlnx118: Min avg qual:   15
[2013-11-28 03:32] ukapdlnx118: P-value thresh: 0.98
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,395 VariantAnnotator - Processed 57 loci.
[2013-11-28 03:32] ukapdlnx118:
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,440 ProgressMeter -            done        1.57e+02    1.0 s        2.0 h     98.5%         1.0 s     0.0 s
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,441 ProgressMeter - Total runtime 1.16 secs, 0.02 min, 0.00 hours
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,441 MicroScheduler - 92 reads were filtered out during the traversal out of approximately 1786 total reads (5.15%)
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,441 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,442 MicroScheduler -   -> 92 reads (5.15% of total) failing DuplicateReadFilter
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,442 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,442 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,442 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
[2013-11-28 03:32] ukapdlnx118: INFO  03:32:05,442 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter
[2013-11-28 03:32] ukapdlnx118: Unexpected error
Traceback (most recent call last):
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 33, in _setup_logging
    yield None
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 139, in variantcall_sample
    return apply(genotype.variantcall_sample, *args)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 477, in variantcall_sample
    region, call_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 67, in unified_genotyper
    region, out_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 40, in _shared_gatk_call_prep
    region = subset_variant_regions(variant_regions, region, out_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/pipeline/shared.py", line 145, in subset_variant_regions
    _subset_bed_by_region(variant_regions, tx_subset_file, safe_region)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/pipeline/shared.py", line 124, in _subset_bed_by_region
    orig_bed.intersect(region_bed).saveas(out_file)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 623, in decorated
    result = method(self, *args, **kwargs)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 216, in wrapped
    check_stderr=check_stderr)
:  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/pybedtools/helpers.py", line 429, in call_bedtools
    raise OSError('See above for commands that gave the error')
OSError: See above for commands that gave the error
[2013-11-28 03:32] ukapdlnx118: Reading input from STDIN
mjafin commented 10 years ago

I can actually see core dumps following these errors. Maybe it's something with our system?

chapmanb commented 10 years ago

Miika; Thanks for the report. Are you seeing this with the NA12878 whole genome, or exome example? The underlying cause appears to be a bedtools failure. I wasn't able to reproduce with the whole genome example on my system after a quick test. What version of bedtools do you have? I tested with v2.17.0.

I do see some other errors from VarScan processing due to VarScan producing non-GATC alleles that make GATK unhappy. I'll work on fixing those but this appears unrelated to your problem, so I'd like to narrow that down.

mjafin commented 10 years ago

Thanks Brad, running it now in local mode with 2 CPUs only and it's chugging along. Here are the settings I'm using

upload:
  dir: ../final
details:
  - files: [/ngs/public_data/NA12878/NA12878-NGv3-LAB1360-A_1.fastq.gz, /ngs/public_data/NA12878/NA12878-NGv3-LAB1360-A_2.fastq.gz]
    description: NA12878-1
    analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      align_split_size: 5000000
      variantcaller: [gatk, freebayes, varscan, gatk-haplotype]
      quality_format: Standard
      coverage_interval: regional
      merge_bamprep: true
      validate: /ngs/public_data/NA12878/NISTIntegratedCalls_13datasets_130719_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.17_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf
      validate_regions: /ngs/public_data/NA12878/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed
      variant_regions: /ngs/public_data/NA12878/NGv3.bed
      ensemble:
        format-filters: [DP < 4]
        classifiers:
          balance: [AD, FS, Entropy]
          calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
        classifier-params:
          type: svm
        trusted-pct: 0.5
  - files: [/ngs/public_data/NA12878/NA12878-NGv3-LAB1360-A_1.fastq.gz, /ngs/public_data/NA12878/NA12878-NGv3-LAB1360-A_2.fastq.gz]
    description: NA12878-2
    analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      align_split_size: 5000000
      mark_duplicates: samtools
      recalibrate: false
      realign: false
      variantcaller: [gatk, freebayes, varscan, gatk-haplotype]
      quality_format: Standard
      coverage_interval: regional
      merge_bamprep: false
      validate: /ngs/public_data/NA12878/NISTIntegratedCalls_13datasets_130719_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.17_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf
      validate_regions: /ngs/public_data/NA12878/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed
      variant_regions: /ngs/public_data/NA12878/NGv3.bed
      ensemble:
        format-filters: [DP < 4]
        classifiers:
          balance: [AD, FS, Entropy]
          calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
        classifier-params:
          type: svm
        trusted-pct: 0.5

I was previously running using 8 cpus and SGE. I suspect it may have something to do with maximum user processes but not sure (both file handles and max user processes have been bumped up quite a bit some time ago so not sure). If the local run finishes fine I'll restart in the queues to see if it's reproducible.

Bedtools is v2.17.0.

chapmanb commented 10 years ago

Miika; Ah, gotcha. That makes sense. I just fixed some problems with leaking file handles this morning so it definitely could be that. I found a place where we were not closing a ZeroMQ context which led to extra connections being held. If it works fine without IPython this is probably the issue. I'm hoping to roll the final 0.7.5 release with these fixes tomorrow after closing remaining issues. Sorry to make you chase down the problem.

mjafin commented 10 years ago

Thanks Brad! I'll ask Jakub to upgrade to 0.7.5 final when it's out and will continue stress testing our setup. I'll let you know if I encounter something like this again

Happy Thanksgiving!

mjafin commented 10 years ago

One more (still pre 0.7.5):

[2013-11-30 00:13]  Genotyping with varscan: ('14', 66734393, 68290593) NA12878-2-sort-14_66734393_68290593-prep.bam
[2013-11-30 00:13]  mpileup for Varscan
[2013-11-30 00:13]  [mpileup] 1 samples in 1 input files
[2013-11-30 00:13]  <mpileup> Set max per-file depth to 8000
[2013-11-30 00:13]  Varscan
[2013-11-30 00:13]  cat: write error: Broken pipe
[2013-11-30 00:13]  Uncaught exception occurred
Traceback (most recent call last):
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 22, in run
    _do_run(cmd, checks)
  File "/apps/bcbio-nextgen/0.7.5a/rhel6-x64/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 114, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; cat /ngs/RDI/Analysis/NA12878_validation/work2/varscan/14/tx/tmpUXn00t/NA12878-2-sort-14_66734393_68290593-prep-variants-raw.mpileup | java -Xms750m -Xmx8g -Duser.language=en -Duser.country=US -jar /apps/bcbio-nextgen/0.7.5a/rhel6-x64/share/java/varscan/VarScan.v2.3.6.jar mpileup2cns --min-coverage 5 --p-value 0.98   --vcf-sample-list /ngs/RDI/Analysis/NA12878_validation/work2/varscan/14/tx/tmpUXn00t/NA12878-2-sort-14_66734393_68290593-prep-variants-raw-sample_list.txt --output-vcf --variants > /ngs/RDI/Analysis/NA12878_validation/work2/varscan/14/tx/tmpUXn00t/NA12878-2-sort-14_66734393_68290593-prep-variants-raw.vcf
cat: write error: Broken pipe
chapmanb commented 10 years ago

Miika; I hope that the file handle fixes in 0.7.5 will remove this problem as well. A lot of these error types were due to being unable to get an open file handle, so they manifest in confusing ways. If you still have problems with the latest versions please do re-open and we can look into it more.