bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

Download population variants from ENSEMBL #9

Closed husensofteng closed 5 years ago

husensofteng commented 5 years ago

get SpeciesName_incl_consequences.vcf.gz from ENSEMBL:

file path: ftp://ftp.ensembl.org/pub/current_variation/vcf//_incl_consequences*.vcf.gz

only for humans the file is split on chromosome while for other species it is a single file for all chromosomes!

ypriverol commented 5 years ago

@husensofteng these VCF's I guess correspond to the 1000Genomes project?

husensofteng commented 5 years ago

Clarification on gnomAD and ENSEMBL: although gnomAD provides a great set of variants from all major projects and they are VEP annotated. However, difficulty in finding a consensus MAF threshold and the lack of the variants in hg18 makes its use complicated.

For the time being, I suggest we stick with ENSEMBL variation dartabase because it does provide MAF for variants from 1000G as well as transcript annotations that are in sync with the GTF and CDS files. Besides, it also provides variants for some other species as well.

husensofteng commented 5 years ago

@husensofteng these VCF's I guess correspond to the 1000Genomes project?

The MAF is calculated based on 1000G, and since we filter based on that so yes the variants that we will be using are going to be based on 1000G.

ypriverol commented 5 years ago

Should we download all the files for each repository? or we should only download the *.vcf.gz . I see another file .tbi is that needed?

Screen Shot 2019-05-02 at 16 54 15
husensofteng commented 5 years ago

no, not all files , just those that exactly match this pattern: _incl_consequences.vcf.gz the tbi files are not needed because we will uncompress and parse them in the script. also, for now we only consider short variants and ignore the structural variants.

ypriverol commented 5 years ago

@husensofteng can you let me know what does files contain?

husensofteng commented 5 years ago

no, not all files , just those that exactly match this pattern: __inclconsequences.vcf.gz the tbi files are not needed because we will uncompress and parse them in the script. also, for now we only consider short variants and ignore the structural variants.

Sorry I meant: _incl_consequences*.vcf.gz (note the star)

husensofteng commented 5 years ago

@husensofteng can you let me know what does files contain?

do you mean the the vcf.gz files?

ypriverol commented 5 years ago

Yes!!!

husensofteng commented 5 years ago

they are standard VCFs that that contain the following columns (tab-separated): chr, start, varID, refAllele, altAlleles, ., ., INFO column (containing MAF, VEP annotations, etc)

e.g:

22  15528187    rs549380368 G   A,T .   .   dbSNP_151;TSA=SNV;E_Freq;E_1000G;E_TOPMed;E_gnomAD;MA=T;MAF=0.000199681;MAC=1;VarPep=0|D|ENST00000252835,1|V|ENST00000252835;Polyphen=0|benign|0|ENST00000252835,1|benign|0|ENST00000252835;Sift=0|tolerated|1|ENST00000252835,1|tolerated|0.2|ENST00000252835;AA=G;RefPep=G;VE=missense_variant|0|mRNA|ENST00000252835,missense_variant|1|mRNA|ENST00000252835;CSQ=A|missense_variant|mRNA|ENST00000252835|G/D|tolerated(1),T|missense_variant|mRNA|ENST00000252835|G/V|tolerated(0.2)
22  15528271    rs576526848 T   A,C .   . dbSNP_151;TSA=SNV;E_Freq;E_1000G;E_ExAC;E_gnomAD;MA=T;MAF=0.000199681;MAC=1;VarPep=0|N|ENST00000643195,1|T|ENST00000643195,0|N|ENST00000252835,1|T|ENST00000252835;Polyphen=0|benign|0.067|ENST00000643195,1|benign|0.017|ENST00000643195,0|benign|0.067|ENST00000252835,1|benign|0.017|ENST00000252835;Sift=0|tolerated|0.08|ENST00000643195,1|tolerated|0.25|ENST00000643195,0|tolerated|0.07|ENST00000252835,1|tolerated|0.25|ENST00000252835;AA=T;RefPep=I;VE=missense_variant|0|mRNA|ENST00000643195,missense_variant|1|mRNA|ENST00000643195,missense_variant|0|mRNA|ENST00000252835,missense_variant|1|mRNA|ENST00000252835;CSQ=A|missense_variant|mRNA|ENST00000252835|I/N|tolerated(0.07),A|missense_variant|mRNA|ENST00000643195|I/N|tolerated(0.08),C|missense_variant|mRNA|ENST00000252835|I/T|tolerated(0.25),C|missense_variant|mRNA|ENST00000643195|I/T|tolerated(0.25)
husensofteng commented 5 years ago

A function is added now and the test case does download a vcf file from ENSEMBL for the specified species.
Note, for humans, a vcf file is downloaded per chromosome since ENSEMBL has distributed the variants based on chromosome due to large file sizes. I will close this issue.