cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

If FORMAT header not present, parse from FORMAT column #283

Closed mbhall88 closed 5 years ago

mbhall88 commented 5 years ago

The FORMAT header is not a requirement of the VCF (v4.2) specifications

It is strongly encouraged that information lines describing the INFO, FILTER and FORMAT entries used in the body of the VCF file be included in the meta-information section. Although they are optional, if these lines are present then they must be completely well-formed.

As such, it would be great if scikit-allel was able to read the FORMAT field from the FORMAT column instead (if it is present). As an example of a VCF, I am currently working with

##fileformat=VCFv4.2
##fileDate==16/07/19
##ALT=<ID=SNP,Description="SNP">
##ALT=<ID=PH_SNPs,Description="Phased SNPs">
##ALT=<ID=INDEL,Description="Insertion-deletion">
##ALT=<ID=COMPLEX,Description="Complex variant, collection of SNPs and indels">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant">
##ALT=<ID=SIMPLE,Description="Graph bubble is simple">
##ALT=<ID=NESTED,Description="Variation site was a nested feature in the graph">
##ALT=<ID=TOO_MANY_ALTS,Description="Variation site was a multinested feature with too many alts to include all in the VCF">
##INFO=<ID=GRAPHTYPE,Number=1,Type=String,Description="Type of graph feature">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  063     CFT073  H131800734      ST38
GC00010438      165     .       C       T       .       .       SVTYPE=SNP;GRAPHTYPE=SIMPLE     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:.:.:.:.:.:.:.:.:.
     .:.:.:.:.:.:.:.:.:.     .:.:.:.:.:.:.:.:.:.     0:19,0:15,0:19,0:15,0:39,0:31,0:0,1:-2.69899,-226.576:223.877
alimanfoo commented 5 years ago

Hi @mbhall88, you can force scikit-allel to read FORMAT fields even if they are not declared in the header. E.g.:

In [18]: callset = allel.read_vcf('callset.vcf', fields=['calldata/GT'], types={'calldata/GT': 'i1'})                                                                        
/home/aliman/malariagen/binder/conda/envs/vector-ops-13624de/lib/python3.6/site-packages/allel/io/vcf_read.py:1257: UserWarning: 'GT' FORMAT header not found
  warnings.warn('%r FORMAT header not found' % name)

In [19]: callset                                                                                                                                                             
Out[19]: 
{'calldata/GT': array([[[-1, -1],
         [-1, -1],
         [-1, -1],
         [ 0, -1]]], dtype=int8)}
mbhall88 commented 5 years ago

Ah cool! This worked. Thank you for the quick response.

alimanfoo commented 5 years ago

Great, no worries!

On Tue, 6 Aug 2019, 15:56 Michael Hall, notifications@github.com wrote:

Closed #283 https://github.com/cggh/scikit-allel/issues/283.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/283?email_source=notifications&email_token=AAFLYQVPBWVO42LHIFSF7YLQDGGIRA5CNFSM4IJU7IW2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOS44NARI#event-2537082949, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQTUQSFUOBH6QVJ5M33QDGGIRANCNFSM4IJU7IWQ .