Closed FerriolCalvet closed 1 year ago
So far we have been using the following ranking of Sequence Ontology terms associated with consequence types of variants: https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html For the grouping of variants we have been using ad-hoc collections on a case by case basis -- see e.g. boostDM. I agree that it would be good to establish once and for all a consensus grouping of SO consequence types for general variant analyses.
From legacy code that has been used in the lab:
# Consequence list taken from: https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
# From high to low deleteriousness
CONSEQUENCES_LIST = [
'transcript_ablation',
'splice_acceptor_variant',
'splice_donor_variant',
'stop_gained',
'frameshift_variant',
'stop_lost',
'start_lost',
'transcript_amplification',
'inframe_insertion',
'inframe_deletion',
'missense_variant',
'protein_altering_variant',
'splice_region_variant',
'incomplete_terminal_codon_variant',
'start_retained_variant',
'stop_retained_variant',
'synonymous_variant',
'coding_sequence_variant',
'mature_miRNA_variant',
'5_prime_UTR_variant',
'3_prime_UTR_variant',
'non_coding_transcript_exon_variant',
'intron_variant',
'NMD_transcript_variant',
'non_coding_transcript_variant',
'upstream_gene_variant',
'downstream_gene_variant',
'TFBS_ablation',
'TFBS_amplification',
'TF_binding_site_variant',
'regulatory_region_ablation',
'regulatory_region_amplification',
'feature_elongation',
'regulatory_region_variant',
'feature_truncation',
'intergenic_variant'
]
GROUPING_DICT = {
'synonymous_variant': 'synonymous',
'missense_variant': 'missense',
'stop_gained': 'nonsense',
'stop_lost': 'nonsense',
'start_lost': 'nonsense',
'splice_donor_variant': 'splicing',
'splice_acceptor_variant': 'splicing',
'splice_region_variant': 'splicing'
}
Closed in this commit. Commit
Sorry I did not do a pull request...
It would be good to have a dictionary or a rank to solve the cases where a given variant is annotated with more than one consequence. (see examples below)
I guess for some projects in the lab some sort of strategy has been defined, but it would be good to have a common one. We could try to rank the consequences here or use the order in which they appear that already seems to be meaningful. We can first agree or propose a solution and then create a script or some way of fixing all this... http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
We could also define a way of grouping consequences into: Nonsense, Missense, Splice affecting, Synonymous.