bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

VarDict/bcftools error (wrong number of fields in LSEQ?) #908

Closed mpschr closed 9 years ago

mpschr commented 9 years ago

I am running the variant2 pipeline for exon samples. Everything seemed to be running fine until some error messages popped up:

First I get lots of errors - I think it is during the vardict task.

Traceback (most recent call last):
  File "/home/mpschr/bin/bcbionextgen/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/mpschr/bin/bcbionextgen/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /home/mpschr/bin/bcbionextgen/tools/bin/bcftools merge -O z -r 1:142539965-173558159 `cat /home/mpschr/Documents/projects/exon/scratch/vardict/1/tx/tmpG81k0p/batch01-1_142539964_173558159-raw-files.txt` > /home/mpschr/Documents/projects/exon/scratch/vardict/1/tx/tmpG81k0p/tx/tmpHX0buS/batch01-1_142539964_173558159-raw.vcf.gz
Error at 1:142539993: wrong number of fields in LSEQ?
' returned non-zero exit status 255
[2015-06-26T23:36Z] Genotyping with VarDict: Inference
[2015-06-26T23:36Z] Picked up _JAVA_OPTIONS: -Dhttp.proxyHost=proxy.charite.de -Dhttp.proxyPort=8080
[2015-06-26T23:36Z] Genotyping with VarDict: Inference
[2015-06-26T23:36Z] Picked up _JAVA_OPTIONS: -Dhttp.proxyHost=proxy.charite.de -Dhttp.proxyPort=8080
[2015-06-26T23:36Z] Resource requests: ; memory: 1.00; cores: 1
[2015-06-26T23:36Z] Configuring 1 jobs to run, using 1 cores each with 1.00g of memory reserved for each job
[2015-06-26T23:36Z] tabix index batch01-3_62037443_93505141-rawSample_AEXX.temp.vcf.gz
[2015-06-26T23:36Z] tabix index batch01-3_62037443_93505141-rawSample_PEXX.temp.vcf.gz
[2015-06-26T23:36Z] Merge variants
[2015-06-26T23:36Z] Error at 3:62045678: wrong number of fields in LSEQ?
[2015-06-26T23:36Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/home/mpschr/bin/bcbionextgen/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/mpschr/bin/bcbionextgen/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /home/mpschr/bin/bcbionextgen/tools/bin/bcftools merge -O z -r 3:62037444-93505141 `cat /home/mpschr/Documents/projects/exon/scratch/vardict/3/tx/tmpnZ1_kE/batch01-3_62037443_93505141-raw-files.txt` > /home/mpschr/Documents/projects/exon/scratch/vardict/3/tx/tmpnZ1_kE/tx/tmpm6g8y0/batch01-3_62037443_93505141-raw.vcf.gz
Error at 3:62045678: wrong number of fields in LSEQ?
' returned non-zero exit status 255

And then bcbio exits with the following error

  File "/home/mpschr/bin/bcbionextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 664, in __call__
    self.retrieve()
  File "/home/mpschr/bin/bcbionextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 547, in retrieve
    raise exception_type(report)
TypeError: __init__() takes at least 3 arguments (2 given)

I am not sure how to interpret these errors. Could somebody point me in the right direction - what could be causing this?

chapmanb commented 9 years ago

Michael; Sorry about the issue. It sounds like there might be some invalid output from VarDict with an problematic LSEQ attribute in the INFO field. A couple of thoughts:

If you can provide us with details to reproduce or identify the issue we can work around it or report the underlying problem upstream to the VarDict developers. Thanks much for all the help debugging.

mpschr commented 9 years ago

Hi Brad,

  1. I am not sure if I have the lastest vardict-java, as no version is printed out from the software. I had made a bcbio develop upgrade last week is vardict updated with it?
  2. It is a tumor-only run
  3. The variant at 3:62045678 is actually the first variant of the vcf file - same goes for all other jobs that fail. The header and first variant look like this:

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_AEXX 3 62045678 . A G 85 PASS SAMPLE=Sample_AEXX;TYPE=SNV;DP=5;VD=5;AF=1;BIAS=0:2;REFBIAS=0:0;VARBIAS=3:2;PMEAN=25.8;PSTD=1;QUAL=36.8;QSTD=1;SBF=1;ODDRATIO=0;MQ=60;SN=10;HIAF=1.0000;ADJAF=0;SHIFT3=0;MSI=2;MSILEN=2;NM=1.2;HICNT=5;HICOV=5;LSEQ=TTATAAATTACCAGTCTTGG;RSEQ=TATGTCTTTATTAGCAGCAT GT:DP:VD:AD:AF:RD:ALD 1/1:5:5:0,5:1:0,0:3,2

Thanks for your help!

mpschr commented 9 years ago

Hi Brad, after answering, I made again a bcbio-develop update and the same error occurs. I just did a restart of the same run - should I clean some files in order to test more thoroughly?

mjafin commented 9 years ago

@mpschr Can I just quickly ask you if you're using the same batch name for several samples? It would seem like the error is coming from here maybe https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/vardict.py#L154 and we've not really tested VarDict much in this scenario.

VarDict can do paired (tumor/normal) variant calling well but doesn't perform other types of batched/joint variant calling and the code at https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/vardict.py#L134 was added merely to try and not break down when there is a run with Gatk/FreeBayes multisample (germline) batch/joint calling.

For your tumor-only samples I would suggest specifying each sample a unique batch name, if the above description fits you.

mpschr commented 9 years ago

hi @mjafin .

Indeed I have specified the same batch name (after the pipeline filed because there was no batchname) for the two samples with which I am doing tests.

Are you saying vardict designed for tumour-normal joint variant calling only? In this case it be even a better choice to just skip VarDict.

If I recall correctly, the phenotype field was also a requirement in the configuration file. Is that correct? If that is the case and VarDict is still a good option for tumor-only variant calling then maybe we can use that metadata in order to check that if there are multiple samples in the same batch and are of the same phenotype, and run seperately at https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/vardict.py#L134 .

Also, I am a bit unclear about the batch metadata: is it used for tumor-normal pairs rather than for a set of samples that were run on the same batch in bcbio?

Thanks for your help! Michael

EDIT: I will run a test now with different batch names

mjafin commented 9 years ago

Hi Michael, Apologies for the unclear explanation, but good to hear we've nailed the issue down. VarDict can be used for both paired and single sample variant calling no problems. For single sample mode please assign a unique batch name for each sample. For paired mode, the pair of tumour and normal should share the batch name, as this is the mechanism bcbio uses to recognise pairs, but all pairs should have unique batch names. The problems you see arise when there are multiple tumours in the same batch - we tried to support this but clearly the merging of the files makes bcftools throw a fit.

It is essential to assign both batch names and phenotypes in metadata. The metadata is best provided by preparing a csv file as per these instructions: https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration

As an aside, in single sample variant calling a lot of downstream filtering is in order for VarDict. There is a suite available outside bcbio at https://github.com/AstraZeneca-NGS/VarDict/ to perform the filtering but unfortunately the documentation is extremely sparse at the moment. It involves essentially annotating the VCF files using dbsnp, Cosmic and clinvar, followed by application of vcf2txt.pl and vardict2mut.pl.

mpschr commented 9 years ago

Hi @mjafin

Ok, thanks for the clarifications and the spot-on diagnosis! VarDict seems to be running just fine now. :+1:

Just to clarify: In the cancer context, the batch attribute is solely used for identifying tumor-normal pairs? Say, if I have a cohort of 50 tumor samples, I should specify 50 different batch names, one for each sample, correct?

Regarding the downstream vcf filtering. I've been toying around with the dbSNP and ExAC data from the GEMINI database, which Brad pointed out to me. Thanks for the pointer though!

Best, Michael

mjafin commented 9 years ago

Hi Michael, You're spot on; assign 50 different batch names, one for each sample (could be e.g. just mysample1-batch, mysample2-batch etc.).

Yes, if you use the Gemini approach then Brad has already implemented most of the filters we use in vcf2txt and vardict2mut.

mpschr commented 9 years ago

Great; Kiitos paljon!

mjafin commented 9 years ago

Eipä kestä!

chapmanb commented 9 years ago

Thanks so much for the awesome diagnosis. Should we be checking and raising an error if we try to use VarDictJava for pooled (non-tumor) multisample calling? We could add in this sanity check to at least let folks know this is not a supported approach. Thanks again.

mpschr commented 9 years ago

@chapmanb I'd throw an exception if there are multiple tumor-phenotype (or same-phenotype) in the same batch. If I understood everything correctly, this is a non-supported configuration by bcbio.

mjafin commented 9 years ago

I agree to both. If there are multiple tumours in the same batch there should be a warning or error. Multisample non-tumor calling is also not a feature of VarDict and my go at trying to sneak around it by merging the variant calls into one file is not that successful so maybe just disable the whole hack.

chapmanb commented 9 years ago

Miika and Michael; Thanks again for all this debugging. I added an up front check so bcbio will provide a useful message if running VarDict with pooled non-tumor samples, or running a somatic batch with multiple tumors. Hopefully this'll make these issues easier to debug in the future. Thanks again.