Selection of impact of variants annotated by VEP, group variants

FerriolCalvet commented 1 year ago

It would be good to have a dictionary or a rank to solve the cases where a given variant is annotated with more than one consequence. (see examples below)

I guess for some projects in the lab some sort of strategy has been defined, but it would be good to have a common one. We could try to rank the consequences here or use the order in which they appear that already seems to be meaningful. We can first agree or propose a solution and then create a script or some way of fixing all this... http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html

We could also define a way of grouping consequences into: Nonsense, Missense, Splice affecting, Synonymous.

chr10:103590136_A>T     chr10:103590136 T       ENSG00000107954 ENST00000369780 Transcript      missense_variant,splice_region_variant  2172    1489    497
chr10:113850156_C>G     chr10:113850156 G       ENSG00000288933 ENST00000692647 Transcript      intron_variant,non_coding_transcript_variant    -       -
chr10:132808904_T>C     chr10:132808904 C       ENSG00000171811 ENST00000368586 Transcript      splice_region_variant,synonymous_variant        7751    7665

koszulordie commented 1 year ago

So far we have been using the following ranking of Sequence Ontology terms associated with consequence types of variants: https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html For the grouping of variants we have been using ad-hoc collections on a case by case basis -- see e.g. boostDM. I agree that it would be good to establish once and for all a consensus grouping of SO consequence types for general variant analyses.

koszulordie commented 1 year ago

From legacy code that has been used in the lab:

# Consequence list taken from: https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
# From high to low deleteriousness
CONSEQUENCES_LIST = [
    'transcript_ablation',
    'splice_acceptor_variant',
    'splice_donor_variant',
    'stop_gained',
    'frameshift_variant',
    'stop_lost',
    'start_lost',
    'transcript_amplification',
    'inframe_insertion',
    'inframe_deletion',
    'missense_variant',
    'protein_altering_variant',
    'splice_region_variant',
    'incomplete_terminal_codon_variant',
    'start_retained_variant',
    'stop_retained_variant',
    'synonymous_variant',
    'coding_sequence_variant',
    'mature_miRNA_variant',
    '5_prime_UTR_variant',
    '3_prime_UTR_variant',
    'non_coding_transcript_exon_variant',
    'intron_variant',
    'NMD_transcript_variant',
    'non_coding_transcript_variant',
    'upstream_gene_variant',
    'downstream_gene_variant',
    'TFBS_ablation',
    'TFBS_amplification',
    'TF_binding_site_variant',
    'regulatory_region_ablation',
    'regulatory_region_amplification',
    'feature_elongation',
    'regulatory_region_variant',
    'feature_truncation',
    'intergenic_variant'
]
GROUPING_DICT = {
    'synonymous_variant': 'synonymous',
    'missense_variant': 'missense',
    'stop_gained': 'nonsense',
    'stop_lost': 'nonsense',
    'start_lost': 'nonsense',
    'splice_donor_variant': 'splicing',
    'splice_acceptor_variant': 'splicing',
    'splice_region_variant': 'splicing'
}

FerriolCalvet commented 1 year ago

Closed in this commit. Commit

Sorry I did not do a pull request...

bbglab / bbgwiki

Selection of impact of variants annotated by VEP, group variants #17