Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
445 stars 151 forks source link

sqlite3 instead of VEP scripts? #433

Closed thedam closed 5 years ago

thedam commented 5 years ago

Hi, I wonder whether it would be good a idea to store all computed outputs for human data in one big sqlite3 database? It would be much easier to use such data, instead of computing it again and again, then parsing etc... Isn't it good idea? (I know, some data is generated dynamically, but it could be updated once, on your side)

so at the end it would be just one big table with all columns, for example:

Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation IMPACT DISTANCE STRAND FLAGS VARIANT_CLASS SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC REFSEQ_MATCH GIVEN_REF USED_REF BAM_EDIT SOURCE GENE_PHENO SIFT PolyPhen EXON INTRON DOMAINS HGVSc HGVSp HGVS_OFFSET HGVSg AF AFR_AF AMR_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF MAX_AF MAX_AF_POPS CLIN_SIG SOMATIC PHENO PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE 1000Gp3_AC 1000Gp3_AF 1000Gp3_EUR_AC 1000Gp3_EUR_AF 29way_logOdds 29way_logOdds_rankscore 29way_pi ALSPAC_AC ALSPAC_AF APPRIS Aloft_Confidence Aloft_Fraction_transcripts_affected Aloft_pred Aloft_prob_Dominant Aloft_prob_Recessive Aloft_prob_Tolerant AltaiNeandertal Ancestral_allele CADD_phred CADD_raw CADD_raw_rankscore DANN_rankscore DANN_score DEOGEN2_pred DEOGEN2_rankscore DEOGEN2_score Denisova ESP6500_AA_AC ESP6500_AA_AF ESP6500_EA_AC ESP6500_EA_AF Eigen-PC-phred_coding Eigen-PC-raw_coding Eigen-PC-raw_coding_rankscore Eigen-pred_coding Eigen-raw_coding Eigen-raw_coding_rankscore Ensembl_geneid Ensembl_proteinid Ensembl_transcriptid ExAC_AC ExAC_AF ExAC_Adj_AC ExAC_Adj_AF ExAC_NFE_AC ExAC_NFE_AF FATHMM_converted_rankscore FATHMM_pred FATHMM_score GENCODE_basic GERP++_NR GERP++_RS GERP++_RS_rankscore GM12878_confidence_value GM12878_fitCons_rankscoreGM12878_fitCons_score GTEx_V7_gene GTEx_V7_tissue GenoCanyon_rankscore GenoCanyon_score Geuvadis_eQTL_target_geneH1-hESC_confidence_value H1-hESC_fitCons_rankscore H1-hESC_fitCons_score HUVEC_confidence_value HUVEC_fitCons_rankscore HUVEC_fitCons_score Interpro_domain LINSIGHT LINSIGHT_rankscore LRT_Omega LRT_converted_rankscore LRT_pred LRT_score M-CAP_pred M-CAP_rankscore M-CAP_score MPC_rankscore MPC_score MVP_rankscore MVP_score MetaLR_pred MetaLR_rankscore MetaLR_score MetaSVM_pred MetaSVM_rankscore MetaSVM_score MutPred_AAchange MutPred_Top5features MutPred_protID MutPred_rankscore MutPred_score MutationAssessor_pred MutationAssessor_rankscoreMutationAssessor_score MutationTaster_AAE MutationTaster_converted_rankscore MutationTaster_model MutationTaster_pred MutationTaster_score PROVEAN_converted_rankscore PROVEAN_pred PROVEAN_score Polyphen2_HDIV_pred Polyphen2_HDIV_rankscore Polyphen2_HDIV_score Polyphen2_HVAR_pred Polyphen2_HVAR_rankscore Polyphen2_HVAR_score PrimateAI_pred PrimateAI_rankscore PrimateAI_score REVEL_rankscore REVEL_score Reliability_index SIFT4G_converted_rankscoreSIFT4G_pred SIFT4G_score SIFT_converted_rankscore SIFT_pred SIFT_score TSL TWINSUK_AC TWINSUK_AFUK10K_AC UK10K_AF Uniprot_acc Uniprot_entry VEP_canonical VEST4_rankscore VEST4_score VindijiaNeandertalaaalt aapos aaref alt bStatistic bStatistic_rankscore cds_strand clinvar_clnsig clinvar_hgvs clinvar_review clinvar_rs clinvar_trait clinvar_var_source codon_degeneracy codonpos fathmm-MKL_coding_group fathmm-MKL_coding_pred fathmm-MKL_coding_rankscore fathmm-MKL_coding_score fathmm-XF_coding_pred fathmm-XF_coding_rankscorefathmm-XF_coding_score genename gnomAD_exomes_AC gnomAD_exomes_AF gnomAD_exomes_AN gnomAD_exomes_ASJ_AC gnomAD_exomes_ASJ_AF gnomAD_exomes_ASJ_AN gnomAD_exomes_ASJ_nhomalt gnomAD_exomes_NFE_AF gnomAD_exomes_NFE_AN gnomAD_exomes_NFE_nhomalt gnomAD_exomes_POPMAX_AC gnomAD_exomes_POPMAX_AF gnomAD_exomes_POPMAX_AN gnomAD_exomes_POPMAX_nhomalt gnomAD_exomes_flag gnomAD_exomes_nhomalt gnomAD_genomes_AC gnomAD_genomes_AF gnomAD_genomes_ANgnomAD_genomes_ASJ_AC gnomAD_genomes_ASJ_AF gnomAD_genomes_ASJ_AN gnomAD_genomes_ASJ_nhomalt gnomAD_genomes_NFE_AC gnomAD_genomes_NFE_AF gnomAD_genomes_NFE_AN gnomAD_genomes_NFE_nhomalt gnomAD_genomes_POPMAX_AC gnomAD_genomes_POPMAX_AF gnomAD_genomes_POPMAX_AN gnomAD_genomes_POPMAX_nhomalt gnomAD_genomes_flag gnomAD_genomes_nhomalt hg18_chr hg18_pos(1-based) hg19_chr hg19_pos(1-based) integrated_confidence_value integrated_fitCons_rankscore integrated_fitCons_score phastCons100way_vertebrate phastCons100way_vertebrate_rankscore phastCons17way_primate phastCons17way_primate_rankscore phastCons30way_mammalian phastCons30way_mammalian_rankscore phyloP100way_vertebrate phyloP100way_vertebrate_rankscore phyloP17way_primate phyloP17way_primate_rankscore phyloP30way_mammalian phyloP30way_mammalian_rankscore ref refcodon rs_dbSNP151 ada_score rf_score SpliceRegion MaxEntScan_alt MaxEntScan_diff MaxEntScan_ref GeneSplicer ExACpLI PHENOTYPES gnomADg gnomADg_AF_NFE gnomADg_POPMAX gnomADg_AF

not a good idea? Cheers

aparton commented 5 years ago

Hi,

We've had a couple of discussions about doing something like this in the past, however it's not something that we feel we can provide right now. Once you consider all possible insertions, deletions and indels alongside the simpler substitutions, then the size of the database would be prohibitive to most of our users. We do have a publically available mysql database encompassing all variants contained within Ensembl, these may be of interest to you: https://www.ensembl.org/info/data/mysql.html

Thank you for your interest in VEP, and if you have any other suggestions, please let us know.

Kind Regards, Andrew

thedam commented 5 years ago

yeah, sometimes I get such variants in "dirty regions":

20 | 33297811 | . | A | AC
20 | 33297811 | . | A | ACC
20 | 33297811 | . | A | C
20 | 33297811 | . | A | CC
20 | 33297811 | . | A | CCC

indeed it would be problematic...