cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

vcf_to_csv -> not recognizing FORMAT #340

Open ayadlin opened 3 years ago

ayadlin commented 3 years ago

Hi - I am trying to copy 3 cols from a VCF file to CSV file. 1 is ID, the second is fromat DS and format GP. using the command below I get a warning allel.vcf_to_csv(my_vcf.vcf', 'my_vcf.csv', fields=['calldata/*'])

UserWarning: '*' FORMAT header not found warnings.warn('%r FORMAT header not found' % name)

Changing the command to

allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['calldata/DS'], types={'calldata/DS':'f4' ) avoids the error but I get an empty csv file - I know there is data (float in the FORMAT:DS column, is there any thing wrong with what I am doing or is there an issue on the csv writing?

allel.read_vcf('my_vcf.vcf', fileds=['DS','GP']) works, but it is a long process- and I am note sure how to go from there to the csv file.

just in case this is useful FORMAT is one of the headers of my file - and the description section defines:

FORMAT=

FORMAT=

FORMAT=

A final question , as you see, GP is a tuple (float, float, float) if I need to assign a type to it in the types dictionary - what would the correct syntax be? Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?

What I am trying to build is a CSV file that conserves the samples identifiers , the SNP IDs and the Dosage (DS) and the posterior probabilities GP. I have about 5000 samples and 10000000 divided across 22 chromosomes. Would appreciate any help on how to extract and consolidate that data

Thanks,

A

PS - I am sure the issue is with reading the calldata as allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'calldata/DS'], types={'calldata/DS':'f4' ) allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'DS']) allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID'])

produce the same .csv with the ID only

hardingnj commented 3 years ago

Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?

This is the number of bytes, so f4 = 8x4 = 32 bit float.

Your code looks ok to me- no obvious problems with how you have specified those fields.

It's difficult to debug without access to the file, but if you could provide a minimal example that fails I'd be happy to look in detail. If you modify the numbers to obscure anything potentially identifiable/privileged that would be good too.

vcflibcontains some useful commands to downsample VCF files.

ayadlin commented 3 years ago

Hi thanks got the quick reply. I will ask permission, as I don’t have authorization to share the files (I’m only an user), and extract and anonymize with vcflib.

In the meantime could it be the size of the file that’s is a problem? Would you recommend dividing them into subfiles or into chunks?

Thanks!

A

On Sep 13, 2020, at 9:31 AM, Nick Harding notifications@github.com wrote:

 Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?

This is the number of bytes, so f4 = 8x4 = 32 bit float.

Your code looks ok to me- no obvious problems with how you have specified those fields.

It's difficult to debug without access to the file, but if you could provide a minimal example that fails I'd be happy to look in detail. If you modify the numbers to obscure anything potentially identifiable/privileged that would be good too.

vcflibcontains some useful commands to downsample VCF files.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

hardingnj commented 3 years ago

I think it's unlikely to be the size... there are ways of chunking the file within allel if it's very large. The sorting could be a problem maybe. It might be worth using the region argument to read a subset of the data too.

A small subset of the data that fails to work can tell us much more than the error message.