cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

'variants/numalt' identical for all samples even though GT different #315

Closed tomwells-oxf closed 4 years ago

tomwells-oxf commented 4 years ago

Hi, I'm trying to extract genotypes from a vcf created with gatk's GenotypeGVCF for ~200 samples. When I choose an individual sample with

callset=allel.read_vcf('file.vcf', samples=['sample'], fields='*', alt_number=6)

the vcf for each sample seems to load fine for all fields except 'variants/numalt', which is identical in every single sample... I've compared calldata/GT for the samples, and variants/ALT, and these differ between the samples, so I can't work out why numalt would be the same.

Is there something I am doing wrong, is this an issue with scikit-allel, or is it a problem with gatk output maybe?

Thanks for your help.

Tom

hardingnj commented 4 years ago

Hi Tom.

I would expect variants/numalt to be the same for all samples- because variants contains annotations that refer to variants (as opposed to calldata which contains sample annotations). variants/numalt should contain the number of alternate alleles observed across the cohort, or 6 if you explicitly set the ALT alleles count to this.

Equally, variants/ALT should be the same for all samples. So I'm suprised you are seeing a difference there?

If you could share a single line from the VCF that would help and we can review expected behaviour.

If you instead want to count the number of alternate alleles in each sample, you could do something like:

callset=allel.read_vcf('file.vcf', fields='*', alt_number=6)
gt = allel.GenotypeArray(callset['calldata/GT'])
gt.to_n_alt()
tomwells-oxf commented 4 years ago

Thanks for getting back to me so quickly. I'd completely misunderstood what numalt did! Sorry for wasting your time - all sorted now.