Closed husensofteng closed 5 years ago
@husensofteng these VCF's I guess correspond to the 1000Genomes project?
Clarification on gnomAD and ENSEMBL: although gnomAD provides a great set of variants from all major projects and they are VEP annotated. However, difficulty in finding a consensus MAF threshold and the lack of the variants in hg18 makes its use complicated.
For the time being, I suggest we stick with ENSEMBL variation dartabase because it does provide MAF for variants from 1000G as well as transcript annotations that are in sync with the GTF and CDS files. Besides, it also provides variants for some other species as well.
@husensofteng these VCF's I guess correspond to the 1000Genomes project?
The MAF is calculated based on 1000G, and since we filter based on that so yes the variants that we will be using are going to be based on 1000G.
Should we download all the files for each repository? or we should only download the *.vcf.gz
. I see another file .tbi
is that needed?
no, not all files , just those that exactly match this pattern: _incl_consequences.vcf.gz the tbi files are not needed because we will uncompress and parse them in the script. also, for now we only consider short variants and ignore the structural variants.
@husensofteng can you let me know what does files contain?
no, not all files , just those that exactly match this pattern: __inclconsequences.vcf.gz the tbi files are not needed because we will uncompress and parse them in the script. also, for now we only consider short variants and ignore the structural variants.
Sorry I meant: _incl_consequences*.vcf.gz (note the star)
@husensofteng can you let me know what does files contain?
do you mean the the vcf.gz files?
Yes!!!
they are standard VCFs that that contain the following columns (tab-separated): chr, start, varID, refAllele, altAlleles, ., ., INFO column (containing MAF, VEP annotations, etc)
e.g:
22 15528187 rs549380368 G A,T . . dbSNP_151;TSA=SNV;E_Freq;E_1000G;E_TOPMed;E_gnomAD;MA=T;MAF=0.000199681;MAC=1;VarPep=0|D|ENST00000252835,1|V|ENST00000252835;Polyphen=0|benign|0|ENST00000252835,1|benign|0|ENST00000252835;Sift=0|tolerated|1|ENST00000252835,1|tolerated|0.2|ENST00000252835;AA=G;RefPep=G;VE=missense_variant|0|mRNA|ENST00000252835,missense_variant|1|mRNA|ENST00000252835;CSQ=A|missense_variant|mRNA|ENST00000252835|G/D|tolerated(1),T|missense_variant|mRNA|ENST00000252835|G/V|tolerated(0.2)
22 15528271 rs576526848 T A,C . . dbSNP_151;TSA=SNV;E_Freq;E_1000G;E_ExAC;E_gnomAD;MA=T;MAF=0.000199681;MAC=1;VarPep=0|N|ENST00000643195,1|T|ENST00000643195,0|N|ENST00000252835,1|T|ENST00000252835;Polyphen=0|benign|0.067|ENST00000643195,1|benign|0.017|ENST00000643195,0|benign|0.067|ENST00000252835,1|benign|0.017|ENST00000252835;Sift=0|tolerated|0.08|ENST00000643195,1|tolerated|0.25|ENST00000643195,0|tolerated|0.07|ENST00000252835,1|tolerated|0.25|ENST00000252835;AA=T;RefPep=I;VE=missense_variant|0|mRNA|ENST00000643195,missense_variant|1|mRNA|ENST00000643195,missense_variant|0|mRNA|ENST00000252835,missense_variant|1|mRNA|ENST00000252835;CSQ=A|missense_variant|mRNA|ENST00000252835|I/N|tolerated(0.07),A|missense_variant|mRNA|ENST00000643195|I/N|tolerated(0.08),C|missense_variant|mRNA|ENST00000252835|I/T|tolerated(0.25),C|missense_variant|mRNA|ENST00000643195|I/T|tolerated(0.25)
A function is added now and the test case does download a vcf file from ENSEMBL for the specified species.
Note, for humans, a vcf file is downloaded per chromosome since ENSEMBL has distributed the variants based on chromosome due to large file sizes.
I will close this issue.
get SpeciesName_incl_consequences.vcf.gz from ENSEMBL:
file path: ftp://ftp.ensembl.org/pub/current_variation/vcf//_incl_consequences*.vcf.gz
only for humans the file is split on chromosome while for other species it is a single file for all chromosomes!