bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

Vardict: Incorrect input detected in teststrandbias.r #1887

Closed ghost closed 5 years ago

ghost commented 7 years ago

I am trying to perform ensemble variant analysis against GRCh37 with Freebayes, Varscan and VarDict. However, there seems to be a hold up on VarDict's inference:

[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:38Z] Genotyping with VarDict: Inference
[2017-04-05T15:39Z] Error: Incorrect input detected in teststrandbias.R
[2017-04-05T15:39Z] Execution halted
[2017-04-05T15:39Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 22, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 102, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; unset R_HOME && unset JAVA_HOME && export PATH=/usr/local/share/bcbio/anaconda/bin:$PATH && export VAR_DICT_OPTS='-Xms750m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=/home/ubuntu/bcbiotx/tmpokNnFO' && vardict-java -G /usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -f 0.01 -N TUMOR -b /home/ubuntu/align/TUMOR/TUMOR-sort.bam -c 1 -S 2 -E 3 -g 4 -Q 10 /home/ubuntu/vardict/1/NEO_E2E_run1-1_93068150_108578394-unmerged-regions-nolcr-regionlimit.bed | teststrandbias.R| var2vcf_valid.pl -A -N TUMOR -E -f 0.01  | /usr/local/share/bcbio/anaconda/bin/py -x 'bcbio.variation.vcfutils.add_contig_to_header(x, "/usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa")' | bcftools filter -i 'QUAL >= 0' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $5) } {print}' |  awk -F$'\t' -v OFS='\t' '$1!~/^#/ && $4 == $5 {next} {print}' | /usr/local/share/bcbio/anaconda/bin/vcfstreamsort | bgzip -c > /home/ubuntu/bcbiotx/tmpokNnFO/NEO_E2E_run1-1_93068150_108578394-raw.vcf.gz
Error: Incorrect input detected in teststrandbias.R
Execution halted
' returned non-zero exit status 1
[2017-04-05T15:39Z] Genotyping with VarDict: Inference
[2017-04-05T15:39Z] Error: Incorrect input detected in teststrandbias.R
[2017-04-05T15:39Z] Execution halted
[2017-04-05T15:39Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 22, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/usr/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 102, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)

I tried updating vardict with bcbio_conda install -c bioconda vardict, but that did not seem to solve the issue. This is the yaml file:

upload:
  dir: /home/ubuntu/project/final
details:
    - files: [/home/ubuntu/project/work/sample2_f.fastq.gz, /home/ubuntu/project/work/sample2_r.fastq.gz]
    description: TUMOR
    metadata:
      batch: batch1
      phenotype: tumor
    analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      coverage_interval: regional
      mark_duplicates: true
      remove_lcr: true
      recalibrate: false
      realign: false
      quality_format: standard
      variantcaller: [freebayes, vardict, varscan]
      variant_regions: /home/ubuntu/project/work/Exome_Annotated.bed
      ensemble:
        numpass: 2
        format-filters: [DP < 4]
        classifier-params:
          type: svm
        classifiers:
          balance: [AD, FS, Entropy]
          calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
          trusted-pct: 0.65
      clinical_reporting: true
      min_allele_fraction: 1 # 1% allelic fraction
      effects_transcripts: canonical
      # ploidy:
  - files: [/home/ubuntu/project/work/sample1_f.fastq.gz, /home/ubuntu/project/work/sample1_r.fastq.gz]
    description: NORMAL
    metadata:
      batch: batch1
      phenotype: normal
    analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      coverage_interval: regional
      mark_duplicates: true
      remove_lcr: true
      recalibrate: false
      realign: false
      hlacaller: bwakit

thanks in advance.

chapmanb commented 7 years ago

David; Thanks for the report, it looks like VarDict is producing some problematic output from your analysis. Before we get to the failure, I'm not sure you're configuration is as intended. Do you want to analyze these as a tumor/normal pair? If so, then you need to also supply the same variantcaller and other options to the normal sample so bcbio knows to batch them together for VarDict. As is, bcbio is running this as tumor-only, which I'm not sure you intended.

If you did want that, we'll need to debug further. Could you try running the vardict-java command outside of bcbio and then feeding the output to teststrandbias.R:

unset JAVA_HOME && export PATH=/usr/local/share/bcbio/anaconda/bin:$PATH && export VAR_DICT_OPTS='-Xms750m -Xmx3500m -XX:+UseSerialGC' && vardict-java -G /usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -f 0.01 -N TUMOR -b /home/ubuntu/align/TUMOR/TUMOR-sort.bam -c 1 -S 2 -E 3 -g 4 -Q 10 /home/ubuntu/vardict/1/NEO_E2E_run1-1_93068150_108578394-unmerged-regions-nolcr-regionlimit.bed > raw_vardict.out
cat raw_vardict.out | teststrandbias.R

If that fails and you could provide more details about the output of raw_vardict.out happy to help debug further to determine what is wrong. Thanks much.

ghost commented 7 years ago

Hi Brad,

thanks for clarification on the yaml configuration. However, I am still receiving the Error: Incorrect input detected in teststrandbias.R.

new yaml:

---
upload:
  dir: /home/ubuntu/project/final
details:
  - files: [/home/ubuntu/project/work/sample2_f.fastq.gz, /home/ubuntu/project/work/sample2_r.fastq.gz]
    description: TUMOR
    metadata:
      batch: batch1
      phenotype: tumor
    analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      coverage_interval: regional
      mark_duplicates: true
      remove_lcr: true
      recalibrate: false
      realign: false
      quality_format: standard
      variantcaller:
        somatic: [freebayes, vardict, varscan]
        germline: [freebayes, vardict, varscan]
      variant_regions: /home/ubuntu/project/work/exome.bed
      ensemble:
        numpass: 2
      clinical_reporting: true
      min_allele_fraction: 1 
      effects_transcripts: canonical
  - files: [/home/ubuntu/project/work/sample1_f.fastq.gz, /home/ubuntu/project/work/sample1_r.fastq.gz]    
    description: NORMAL
    metadata:
      batch: batch1
      phenotype: normal
    analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      coverage_interval: regional
      mark_duplicates: true
      remove_lcr: true
      recalibrate: false
      realign: false
      quality_format: standard
      variantcaller:
        somatic: [freebayes, vardict, varscan]
        germline: [freebayes, vardict, varscan]
      variant_regions: /home/ubuntu/project/work/exome.bed
      ensemble:
        numpass: 2
      clinical_reporting: true
      min_allele_fraction: 1 
      effects_transcripts: canonical
      svcaller: cnvkit
fc_date: '20170406'
fc_name: test_run

I followed your instructions and ran

$ unset JAVA_HOME && export PATH=/usr/local/share/bcbio/anaconda/bin:$PATH && export VAR_DICT_OPTS='-Xms750m -Xmx3500m -XX:+UseSerialGC' && vardict-java -G /usr/local/share/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -f 0.01 -N NORMAL-germline -b /home/ubuntu/project/align/NORMAL/NORMAL-sort.bam -c 1 -S 2 -E 3 -g 4 -Q 10 /home/ubuntu/project/vardict/1/NORMAL-germline-1_93298425_109001698-unmerged-regions-nolcr-regionlimit.bed  > raw_Vardict.out

$ head raw_Vardict.out
NORMAL-germline     1   93306206    93306207    C   CT  53  3   19  31  2   1   C/+1    0.0566  2;2 26.7    1   38.7    1   60.0    6.000   0.0566  0   9   10.000  1   0.0 3   53  AAAAAGAAGAGGTATGTCGT    TTTTTTTTTGTCTTTTCAAG    1:93306087-93306421 Insertion
NORMAL-germline     1   93309641    93309641    C   T   95  3   40  52  2   1   C/T 0.0316  2;2 9.0 1   30.3    1   60.0    2.000   0.0217  0   0   3.000   1   0.7 2   92  TTCATTCAGTTGAAGAAGTG    CCATGCCGACTTTGCTTCTC    1:93308939-93309802 SNV
NORMAL-germline CCDC18-AS1_Exon1    1   93341983    93341983    C   T   108 2   41  65  1   1   C/T 0.0185  2;2 21.5    1   38.0    0   60.0    4.000   0.0187  0   1   1.000   1   1.5 2   107 TCATCCGCACATATGAGAAG    GAGCCTGTGAGTAAAAAAGA    1:93341851-93342083 SNV
NORMAL-germline CCDC18-AS1_Exon1    1   93342028    93342028    C   A   118 2   24  92  0   2   C/A 0.0169  2;0 20.0    1   39.0    1   60.0    4.000   0.0171  0   2   3.000   1   1.0 2   117 AACGATCATTCATTCAGCAC    AAATCAGCATTCAATGTCCT    1:93341851-93342083 SNV
NORMAL-germline CCDC18-AS1_Exon1    1   93342055    93342055    C   T   103 2   16  85  1   1   C/T 0.0194  2;2 13.5    1   32.5    1   60.0    4.000   0.0198  0   0   1.000   1   2.0 2   101 GCATTCAATGTCCTAGAACA    GACTTCTCACCCTTCTTGGA    1:93341851-93342083 SNV
NORMAL-germline FNBP1L_Exon1    1   93545157    93545157    G   A   50  50  0   0   11  39  A/A 1.0000  0;2 37.9    1   36.7    1   60.0    15.667  1.0000  0   2   2.000   1   1.3 47  47  TGGGGGAAGGAGGAAGAAAG    AAGCAAGTATAAAAGGGAAA    1:93545032-93545248 SNV
NORMAL-germline BCAR3_Exon1 1   93575912    93575912    G   A   73  2   20  51  0   2   G/A 0.0274  2;0 34.5    1   32.5    1   60.0    4.000   0.0274  0   1   1.000   1   1.5 2   73  TAAAGCCAAAAAGCCAGCAT    TAAATTTGAAGAGGGTCAGG    1:93575749-93576192 SNV
NORMAL-germline BCAR3_Exon1 1   93580581    93580581    G   C   25  25  0   0   9   16  C/C 1.0000  0;2 36.6    1   38.2    1   60.0    50.000  1.0000  0   1   1.000   1   1.3 25  25  TCACACACCTCATATTGATT    CAGTGTGATTGATTCAGATG    1:93580506-93580682 SNV
NORMAL-germline BCAR3_Exon1 1   93581024    93581024    T   C   57  27  18  12  9   18  T/C 0.4737  2;2 29.8    1   38.4    1   60.0    54.000  0.4737  0   0   1.000   1   1.5 27  57  ACATTTTTTTCCCTCTAATG    AGAGGGGTGGTGCACTTAAG    1:93580999-93581192 SNV
NORMAL-germline     1   93602194    93602194    C   T   25  25  0   0   16  9   T/T 1.0000  0;2 33.6    1   36.8    1   60.0    24.000  1.0000  0   0   1.000   1   1.2 24  24  CATAGGATGCTAAAAAAGCT    AGTAGACAATTCTGGCCTTC    1:93602186-93602554 SNV

Could the issue be the empty value in the second column of the first two rows? cut raw_Vardict.out -f 34 is returning the Insertion/SNV column so the output does have 34 columns as teststrandbias.R requires. loading the raw_Vardict.out in R with read.table has dimensions of 413 rows and 34 columns, but running line 8 of teststrandbias.R returns a value of 33 for the first two rows that have empty values in the second column. Any suggestions?

thanks.

chapmanb commented 7 years ago

David; Thanks for the details on this. On the practical side, you probably want to use germline focused callers rather than VarDict and VarScan for the germline specification. You'd be fine here with a single caller and doing germline: [freebayes] but if you really want 3 callers and have a GATK license (or are an academic lab) I'd use germline: [freebayes, gatk-haplotype, samtools].

On the actual VarDict error, I'll have to page @mjafin for this. Miika, do you know what column 2 is meant to be there and why it's empty in some cases?

roryk commented 5 years ago

Thanks, closing as this is a stale issue and we haven't had other reports of it.