bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

tumor-normal varscan pipeline runs with error #491

Closed kspham closed 10 years ago

kspham commented 10 years ago

Hello Brad, bcbio-nextgen gave errors when using with varscan on normal-tumor pair data.

File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/samtools.py", line 40, in shared_variantcall tx_out_file) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/varscan.py", line 142, in _varscan_paired region=target_regions) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/vcfutils.py", line 283, in combine_variant_files do.run(cmd, "Combine variant files") File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run _do_run(cmd, checks, log_stdout) File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 122, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) subprocess.CalledProcessError: Command '/usr/local/bin/gatk-framework -Xms750m -Xmx2500m -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T CombineVariants -R /usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --out /media/proj2/schi/2s2ns/work/varscan/3/tx/tmpFz2Ea6/tx/tmpr8a0zm/2your-arbitrary-batch-name-3_59493180_62833181-raw.vcf --variant:v0 /media/proj2/schi/2s2ns/work/varscan/3/tx/tmpFz2Ea6/2your-arbitrary-batch-name-3_59493180_62833181-raw.snp.vcf.gz --variant:v1 /media/proj2/schi/2s2ns/work/varscan/3/tx/tmpFz2Ea6/2your-arbitrary-batch-name-3_59493180_62833181-raw.indel.vcf.gz --rod_priority_list v0,v1 --suppressCommandLineHeader --setKey null -L /media/proj2/schi/2s2ns/work/varscan/3/2your-arbitrary-batch-name-3_59493180_62833181-raw-regions.bed --interval_set_rule INTERSECTION INFO 08:45:38,814 HelpFormatter - --------------------------------------------------------------------------------- INFO 08:45:38,816 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-11-g7e610ad, Compiled 2014/05/15 11:37:54 INFO 08:45:38,816 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 08:45:38,817 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 08:45:38,820 HelpFormatter - Program Args: -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T CombineVariants -R /usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --out /media/proj2/schi/2s2ns/work/varscan/3/tx/tmpFz2Ea6/tx/tmpr8a0zm/2your-arbitrary-batch-name-3_59493180_62833181-raw.vcf --variant:v0 /media/proj2/schi/2s2ns/work/varscan/3/tx/tmpFz2Ea6/2your-arbitrary-batch-name-3_59493180_62833181-raw.snp.vcf.gz --variant:v1 /media/proj2/schi/2s2ns/work/varscan/3/tx/tmpFz2Ea6/2your-arbitrary-batch-name-3_59493180_62833181-raw.indel.vcf.gz --rod_priority_list v0,v1 --suppressCommandLineHeader --setKey null -L /media/proj2/schi/2s2ns/work/varscan/3/2your-arbitrary-batch-name-3_59493180_62833181-raw-regions.bed --interval_set_rule INTERSECTION INFO 08:45:38,824 HelpFormatter - Executing as snow@son on Linux 3.13.0-29-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_55-b14. INFO 08:45:38,824 HelpFormatter - Date/Time: 2014/07/08 08:45:38 INFO 08:45:38,824 HelpFormatter - --------------------------------------------------------------------------------- INFO 08:45:38,824 HelpFormatter - --------------------------------------------------------------------------------- INFO 08:45:38,889 GenomeAnalysisEngine - Strictness is SILENT INFO 08:45:38,977 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 08:45:39,042 IntervalUtils - Processing 3339896 bp from intervals WARN 08:45:39,042 IndexDictionaryUtils - Track v0 doesn't have a sequence dictionary built in, skipping dictionary validation WARN 08:45:39,042 IndexDictionaryUtils - Track v1 doesn't have a sequence dictionary built in, skipping dictionary validation INFO 08:45:39,101 GenomeAnalysisEngine - Preparing for traversal INFO 08:45:39,103 GenomeAnalysisEngine - Done preparing for traversal INFO 08:45:39,104 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 08:45:39,104 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.1-11-g7e610ad):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: The provided VCF file is malformed at approximately line number 2529: unparsable vcf record with allele M, for input source: /media/proj2/schi/2s2ns/work/varscan/3/tx/tmpFz2Ea6/2your-arbitrary-batch-name-3_59493180_62833181-raw.snp.vcf.gz
ERROR ------------------------------------------------------------------------------------------

' returned non-zero exit status 1

I checked the vcf.gz but it doesn't exist.

Thanks much!

chapmanb commented 10 years ago

Son; It looks like VarScan is outputting variants with non GATC alleles. There was a fix for this in non-paired VarScan cleanup, and I added it to paired VarScan analyses as well. If you update to the latest development, hopefully this will fix the issues. Thanks much for the report.

kspham commented 10 years ago
129         # We do this before combining them otherwise merging may

fail 130 # if there are invalid records 131 132 if do.file_exists(snp_file): 133 to_combine.append(snp_file) --> 134 _fix_varscan_vcf(snp_file, paired.normal_name, paired.tumor_name) snp_file = '/media/proj2/schi/2s2ns/work/varscan/3/tx/tmpl7Q...itrary-batch-name-3_59493180_62833181-raw.snp.vcf' paired.normal_name = '2ns' paired.tumor_name = '2s' 135 136 if do.file_exists(indel_file): 137 to_combine.append(indel_file) 138 _fix_varscan_vcf(indel_file, paired.normal_name, paired.tumor_name)

........................................................................... /usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/varscan.py in _fix_varscan_vcf(orig_file='/media/proj2/schi/2s2ns/work/varscan/3/tx/tmpl7Q...itrary-batch-name-3_59493180_62833181-raw.snp.vcf', normal_name='2ns', tumor_name='2s') 175 with open(tmp_file) as in_handle: 176 with open(tx_out_file, "w") as out_handle: 177 178 for line in in_handle: 179 line = _clean_varscan_line(_fix_varscan_output(line, normal_name, --> 180 tumor_name)) tumor_name = '2s' 181 if not line: 182 continue 183 out_handle.write(line) 184

........................................................................... /usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/varscan.py in _clean_varscan_line(line=None) 378 379 380 def _clean_varscan_line(line): 381 """Avoid lines with non-GATC bases, ambiguous output bases make GATK unhappy. 382 """ --> 383 if not line.startswith("#"): line.startswith = undefined 384 parts = line.split("\t") 385 alleles = [x.strip() for x in parts[4].split(",")] + [parts[3].strip()] 386 for a in alleles: 387 if len(set(a) - set("GATCgatc")) > 0:

AttributeError: 'NoneType' object has no attribute 'startswith'


On Tue, Jul 8, 2014 at 2:24 PM, Son Pham kspham@eng.ucsd.edu wrote:

Sorry --- something else!

On Tue, Jul 8, 2014 at 2:21 PM, Son Pham kspham@eng.ucsd.edu wrote:

Brad: file varscan.py line 237 if(line.startswith("##")): please replace it with startwith, not startSwith Thank you! Son.

On Tue, Jul 8, 2014 at 12:45 PM, Brad Chapman notifications@github.com wrote:

Son; It looks like VarScan is outputting variants with non GATC alleles. There was a fix for this in non-paired VarScan cleanup, and I added it to paired VarScan analyses as well. If you update to the latest development, hopefully this will fix the issues. Thanks much for the report.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/491#issuecomment-48390528 .

chapmanb commented 10 years ago

Son; Apologies, it was not handling the case when the line had previously been filtered. The latest push should hopefully consider this as well. Please let us know if you run into any other problems.