brittneybrinsfield / pysam

Automatically exported from code.google.com/p/pysam
0 stars 0 forks source link

Pysam doesn't support VCF 4.1 yet? #85

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. pysam-0.6

import pysam
import sys

# usage: <all variant positions.vcf> <each individual sample>

all_variants_vcf_fn = sys.argv[1]

vcf = pysam.VCF()
for ln in vcf.parse(open(all_variants_vcf_fn)):
        print vcf

3.

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward 
bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping 
quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples 
being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the 
first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the 
first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype 
frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value 
based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype 
likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable 
unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained 
genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ 
bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is 
an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the 
nonRef allele frequency in group1 samples being larger (,smaller) than in 
group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 
P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a 
smaller PCHI2.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA 
genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias 
P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled 
genotype likelihoods">

What is the expected output? What do you see instead?

Line 26: '##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of 
Phred-scaled genotype likelihoods">'
Error BADLY_FORMATTED_FORMAT_STRING: Formatting error in the format string
Traceback (most recent call last):
  File "scripts/tabulate_multiple_vcf.py", line 10, in <module>
    for ln in vcf.parse(open(all_variants_vcf_fn)):
  File "cvcf.pyx", line 941, in cvcf.VCF.parse (pysam/cvcf.c:18961)
  File "cvcf.pyx", line 856, in cvcf.VCF._parse_header (pysam/cvcf.c:17303)
  File "cvcf.pyx", line 511, in cvcf.VCF.parse_header (pysam/cvcf.c:9401)
  File "cvcf.pyx", line 394, in cvcf.VCF.parse_format (pysam/cvcf.c:6414)
  File "cvcf.pyx", line 334, in cvcf.VCF.error (pysam/cvcf.c:5021)
ValueError: Formatting error in the format string

What version of the product are you using? On what operating system?

0.6, Linux-amd64

Please provide any additional information below.

Original issue reported on code.google.com by nlomi...@googlemail.com on 24 Jan 2012 at 1:28

GoogleCodeExporter commented 9 years ago
I think the main problem with VCF 4.1 is that the "Phred-scaled likelihoods" 
(PL) variable is now a vector with non-fixed number of integers and the length 
is instead the number of possible genotypes (as explained in 
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-cal
l-format-version-41). Something similar is true for the "Allele count" (AC) and 
the "Allele Frequency" (AF) variables. More in general, the Number variable can 
now be equal to "A" or "G" but pysam does not seem to like that at the moment. 
A dirty workaround is to remove triallelic sites from the VCF file with a 
command like this:
sed -e 's/Number=A/Number=1/g' -e 's/Number=G/Number=3/g' input.vcf | awk 
'$0~"^#" || $5!~","' > output.vcf
The new file should be parsable by pysam.

Original comment by giulio.g...@gmail.com on 7 Jun 2012 at 6:25

GoogleCodeExporter commented 9 years ago
We have support for this in HEAD in PyVCF: 
http://pyvcf.readthedocs.org/en/latest/index.html

Original comment by cas...@gmail.com on 18 Jun 2012 at 9:59

GoogleCodeExporter commented 9 years ago
Thanks, I have reworked the vcf module a little. It should now support more vcf 
flavours.

However, it is still experimental.

Best wishes,
Andreas

Original comment by andreas....@gmail.com on 16 Aug 2012 at 8:00