chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

Error at the validation step #18

Open ssaif opened 9 years ago

ssaif commented 9 years ago

Hello,

I am trying to incorporate the ensemble approach in my bcbio analysis and getting errors at the bcbio.variation command for validation of calls. Here are some details,

Run log - /gpfs/ngs/oncology/Analysis/external/EXT_001_NA12878/EDGE/NA12878_bcbio_NGv3bed/work/run.log

Yaml file for bcbio.variation (to validate freebayes calls) -/gpfs/ngs/oncology/Analysis/external/EXT_001_NA12878/EDGE/NA12878_bcbio_NGv3bed/work/validate/NA12878_Germline_NGv3bed/freebayes/config/validate.yaml

Please let me know if you need additional information about the analysis.

Thanks, Sakina

chapmanb commented 9 years ago

Sakina; Thanks for the report. Happy to look at this if you could make the log and validation files available at a Gist (https://gist.github.com/). Thanks much.

ssaif commented 9 years ago

Hi,

They are available here. Please let me know if you can access them.

https://gist.github.com/ssaif/fbb164d1f28b3f4133c3 (Error lines pasted with flanks from the run log) https://gist.github.com/ssaif/40228395b0f50f9585e9 (Yaml file for freebayes validation)

Thanks, Sakina

chapmanb commented 9 years ago

Sakina; Thanks for the additional detail. It appears as if something is wrong with one of your input VCF files, specifically that is has truncated lines. The code is failing when it tries to access the reference allele to remove any gaps, and is finding a line with fewer fields than expected:

https://github.com/chapmanb/bcbio.variation/blob/fc5bac476ec9d9efb79dfd07a07590e319d95ba2/src/bcbio/variation/normalize.clj#L571

It would be worth checking the input VCF to see if something is wrong:

bcftools view /gpfs/ngs/oncology/Analysis/external/EXT_001_NA12878/EDGE/NA12878_bcbio_NGv3bed/work/freebayes/NA12878_Germline_NGv3bed-effects-ploidyfix-filter.vcf.gz

This should spit out the file and perhaps give a better error message to help debug. Hope this helps some with identifying the issue.

ssaif commented 9 years ago

Hi Brad,

Thanks for the quick reposnse. I did a few checks on the vcf file and it seems to check out OK.

Another thing I want to point out is that with this run of bcbio where I am also doing the ensemble step, I notice there are vcf files within each caller directory that seem to contain a combined call set (from all chromosomes). This is typically not seen in the run sans bcbio.variation. And the vcf file where you pointed out the error is one such combined calls file. Are these combined output files part of bcbio.variation run?

In order to test this I will run bcbio.variation standalone on calls generated by chromosomes that will hopefully reproduce this behaviour/error.

Thanks, Sakina

ssaif commented 9 years ago

Forgot to share this that I also found that the freebayes vcf did not have calls on chrM because the Nimblegen bed file did not have chrM regions. But the GiaB NIST's vcf and bed files (with hg19) that I using to validate my calls do have chrM (starts with this order) information. Could this be the cause of the bcbio.variation error I am getting?

Thanks, Sakina

ssaif commented 9 years ago

This was using BCBIO version 0.8.1a (alpha), not sure if I mentioned this earlier.

Thanks, Sakina

chapmanb commented 9 years ago

Sakina; Thanks for looking into this more. I added better debugging into a snapshot release of bcbio.variation. If you could download this and replace the existing version this should hopefully provide the exact line in the VCF it is failing at:

wget https://github.com/chapmanb/bcbio.variation/releases/download/v0.1.8-SNAPSHOT-20140906/bcbio.variation-0.1.8-SNAPSHOT-standalone.jar
mv bcbio.variation-0.1.8-SNAPSHOT-standalone.jar /group/ngs/src/bcbio-nextgen/0.8.1a/rhel6-x64/share/java/bcbio_variation/
rm /group/ngs/src/bcbio-nextgen/0.8.1a/rhel6-x64/share/java/bcbio_variation/bcbio.variation-0.1.7-standalone.jar

Regarding your other observations, the comparison handles cases where the regions differ between the input and reference calls. It will only compare in regions present in both, so this shouldn't be an issue. It also prepares combined VCFs independent of bcbio.variation evaluation. That is done for all calling; this is the final input file concatenated from the input files.

Hope re-running with the updated code will help identify the problematic VCF line and shed more information on what is happening.