Open EugeneEA opened 2 years ago
What sort of annotation would you hope to see? Names of spanned genes? +?
yes, basically annotation with a genomic feature (gene, exon/intron, UTR etc) and the possible consequence on gene expression (eg if frameshift in exon happens, or exon deletion/duplication/inversion watever)
VEP handles SV vcf, so it's annotation can be taken as an example
https://github.com/Illumina/Nirvana might also be relevant?
On Wed, Feb 16, 2022 at 9:19 AM EugeneEA @.***> wrote:
yes, basically annotation with a genomic feature (gene, exon/intron, UTR etc) and the possible consequence on gene expression (eg if frameshift in exon happens, or exon deletion/duplication/inversion watever)
VEP handles SV vcf, so it's annotation can be taken as an example
— Reply to this email directly, view it on GitHub https://github.com/KarchinLab/open-cravat/issues/97#issuecomment-1041539452, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6TETXXSD4XE725BX7WSLU3OXAXANCNFSM5ORRX5XQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
Mike Cariaso http://www.cariaso.com
Probably, I have not tried it
@EugeneEA Hi, yes there is a plan to add the support for SV, CNV, etc. in the future. Can be in this repo or a fork.
@rkimoakbioinformatics Thanks for the answer, but as far as I understand it is not a near future, but plans for the further development?
@EugeneEA I would like to start discussion on it. Can you let me know what kind of output columns you would need? Something like the following?
+-------+-----------+-----+-----------+-------------------------------------------------------------------------------------+
| chrom | start | end | ref | alt | all_mappings |
| chr10 | 121593023 | 121603287 | N | <DEL> | {"GENE1": [["P00001", "", "transcript_ablation", "ENST00000346997.6", ""]], |
| | | | | | "GENE2": [["P00002", "", "transcript_ablation", "ENST000009385.1", ""]]} |
| chr5 | 95849345 | 95853945 | N | <DUP> | {“GENE3”: [[“P00003”, “”, “copy_number_gain”, “ENST000009482.1”, “”]]} |
+-------+-----------+-----------+-----+-------+-----------------------------------------------------------------------------+
For imprecise structural variants, would you still want to see predicted protein sequence change?
@rkimoakbioinformatics sorry for long delay, yes that would be sufficient for starters defenetly. The tricky part probably the filed "transcript_ablation" etc. maybe an additinal column should be added here, for example listing the deleted (exons) etc?
@EugeneEA Thanks. Below is a sketch. The current format of all_mappings
is difficult to parse, sort, and filter. Thus, using putting each transcript in a separate line, something like:
+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+
| chrom | start | end | strand | ref | alt | imprecise | gene | sequence ontology | transcript | exon |
+------------------+----------+--------+---------+----------+-----------+-----------------------------------------------------+----------------+------|
| chr1 | 10394823 | 10404834 | + | 10012nt | - | imprecise | GENE2 | deletion,exon_loss_variant | ENST0000038273 | 2,3 |
| chr1 | 10394823 | 10404834 | + | 10012nt | - | imprecise | GENE2 | deletion,exon_loss_variant | ENST0000038284 | 2,3 |
| chr1 | 2784394 | 2984393 | + | 20000nt | - | imprecise | GENE3 | deletion,transcript_ablation | ENST0000061234 | |
| chr1 | 394823 | 399822 | + | 5000nt | 10000nt | imprecise | GENE5 | duplication,partially_duplicated_transcript | ENST0000047283 | 4 |
| chr1 | 38584598 | 38683853 | + | 99256nt | 198512nt | imprecise | GENE6 | duplication,transcript_amplification | ENST0000038482 | |
+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+
A VCF format specification document has a few structural variant examples:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
1 2827693 . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682 . T <DEL> 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
2 14477084 . C <DEL:ME:ALU> 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
3 9425916 . C <INS:ME:L1> 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 12665100 . A <DUP> 14 PASS IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500 GT:GQ:CN:CNQ ./.:0:3:16.2
4 18665128 . T <DUP:TANDEM> 11 PASS IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10 GT:GQ:CN:CNQ ./.:0:5:8.3
Turning this into something like:
+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chrom | start | end | strand | ref | alt | imprecise | gene | sequence ontology |
+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chr1 | 2827693 | 2827693 | + | CGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA | - | | GENE7 | deletion |
| chr2 | 321682 | 321887 | + | 206nt | - | imprecise | GENE8 | deletion |
| chr2 | 14477084 | 14477085 | + | - | 297nt | imprecise | GENE9 | insertion,Alu_insertion |
| chr3 | 9425916 | 9425917 | + | - | 6027nt | imprecise | GENE10 | insertion,LINE1_insertion |
| chr3 | 12665101 | 12686200 | + | 21100nt | 42200nt | imprecise | GENE11 | duplication |
| chr4 | 18665128 | 18665204 | + | 77nt | 154nt | imprecise | GENE12 | duplication,tandem_duplication |
+-------+----------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
Of course, INFO
fields should be parsed and recorded in other columns.
Would something like the above work for your purposes? Any feedback/suggestion would be appreciated.
@rkimoakbioinformatics thankt a lot for the replies! Ok, that looks a bit too verbouse, can we select the major transcript as we do for the SNPs?
Sequence ontology is extremely usefull field but it's aggregation in VEP (https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html) quite simplify the filtering, may be it is something worth the implementation (also for SNPs).
Would these variants be annotated if they are present in some of the annotators (clinvar for example) (I know that these are basically indels, but still might be usefull)
@EugeneEA Yes, if the variants are in ClinVar as well as any other OpenCRAVAT, they will be annotated. I am not sure yet about how imprecise variants are treated in annotation data sources, but that will be the spirit.
As far as I know, VEP outputs sequence ontologies for each transcript on separate lines in its native output format, or on the same line delimited in the VCF format. I am not aware of aggregation by VEP - does it aggregate? If you mean, by aggregation, something like showing all sequence ontologies from all the variants for a transcript together, that has been planned but we haven't gotten to work on it yet.
By selecting a major transcript, you mean the current OpenCRAVAT's style of showing the mutation consequence on a representative transcript, either a MANE one or a custom choice for a gene, and that on all the other transcripts where the variant falls in another column?
@rkimoakbioinformatics By "VEP aggregation" I meant that they summarise sequence ontology from 30+ fileds to 4 (High, low, moderate, modifier) and provides it as an additional info column. It is a usefull feature for medical people, even though it is trivial to implement it might worth to include it into default output.
Yes, that is exectly what I meant either consequence or MANE, and the rest goes to other column
Hi EugeneEA,
Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.
Hi EugeneEA,
Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.
Hi! Nothing above what was mentioned earlier so far, but in general, it would be super helpful if your SV support will follow the same frame as usual snp/INDELS module in terms of possibility of adding custom annotators.
For examples - we are analyzing a lot of samples with some SV detection tools and currently I have to annotate each new sample with the SV frequency from internal database using VEP + custom scripts. I'd love to switch to oc for both tasks. Therefore for me a "VEP aggregation" column (or set of columns which I can use as a secondary input for such annotator) and possibility to add custom annotator is a mast.
Best, Eugene
Hi, I've come across the problem that oc does not annotate SV vcf's are there plans to support SV in a future or maybe thereis a workaround? the common line format:
chr1 964964 20 N <DEL> 137.6 . SVTYPE=DEL;SVLEN=-366;END=965330;STRANDS=+-:10;IMPRECISE;CIPOS=-30 ...etc
Best, Eugene