Structural variant vcf annotation

EugeneEA commented 2 years ago

Hi, I've come across the problem that oc does not annotate SV vcf's are there plans to support SV in a future or maybe thereis a workaround? the common line format: chr1 964964 20 N <DEL> 137.6 . SVTYPE=DEL;SVLEN=-366;END=965330;STRANDS=+-:10;IMPRECISE;CIPOS=-30 ...etc

Best, Eugene

cariaso commented 2 years ago

What sort of annotation would you hope to see? Names of spanned genes? +?

EugeneEA commented 2 years ago

yes, basically annotation with a genomic feature (gene, exon/intron, UTR etc) and the possible consequence on gene expression (eg if frameshift in exon happens, or exon deletion/duplication/inversion watever)

VEP handles SV vcf, so it's annotation can be taken as an example

cariaso commented 2 years ago

https://github.com/Illumina/Nirvana might also be relevant?

On Wed, Feb 16, 2022 at 9:19 AM EugeneEA @.***> wrote:

yes, basically annotation with a genomic feature (gene, exon/intron, UTR etc) and the possible consequence on gene expression (eg if frameshift in exon happens, or exon deletion/duplication/inversion watever)

VEP handles SV vcf, so it's annotation can be taken as an example

— Reply to this email directly, view it on GitHub https://github.com/KarchinLab/open-cravat/issues/97#issuecomment-1041539452, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6TETXXSD4XE725BX7WSLU3OXAXANCNFSM5ORRX5XQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

--

Mike Cariaso http://www.cariaso.com

EugeneEA commented 2 years ago

Probably, I have not tried it

rkimoakbioinformatics commented 2 years ago

@EugeneEA Hi, yes there is a plan to add the support for SV, CNV, etc. in the future. Can be in this repo or a fork.

EugeneEA commented 2 years ago

@rkimoakbioinformatics Thanks for the answer, but as far as I understand it is not a near future, but plans for the further development?

rkimoakbioinformatics commented 2 years ago

@EugeneEA I would like to start discussion on it. Can you let me know what kind of output columns you would need? Something like the following?

+-------+-----------+-----+-----------+-------------------------------------------------------------------------------------+
| chrom | start     | end       | ref | alt   | all_mappings                                                                |
| chr10 | 121593023 | 121603287 | N   | <DEL> | {"GENE1": [["P00001", "", "transcript_ablation", "ENST00000346997.6", ""]], |
|       |           |           |     |       |  "GENE2": [["P00002", "", "transcript_ablation", "ENST000009385.1", ""]]}   |
| chr5  | 95849345  | 95853945  | N   | <DUP> | {“GENE3”: [[“P00003”, “”, “copy_number_gain”, “ENST000009482.1”, “”]]}      |
+-------+-----------+-----------+-----+-------+-----------------------------------------------------------------------------+

For imprecise structural variants, would you still want to see predicted protein sequence change?

EugeneEA commented 2 years ago

@rkimoakbioinformatics sorry for long delay, yes that would be sufficient for starters defenetly. The tricky part probably the filed "transcript_ablation" etc. maybe an additinal column should be added here, for example listing the deleted (exons) etc?

rkimoakbioinformatics commented 2 years ago

@EugeneEA Thanks. Below is a sketch. The current format of all_mappings is difficult to parse, sort, and filter. Thus, using putting each transcript in a separate line, something like:

+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+
| chrom | start    | end      | strand | ref     | alt      | imprecise | gene  | sequence ontology                           | transcript     | exon |
+------------------+----------+--------+---------+----------+-----------+-----------------------------------------------------+----------------+------|
| chr1  | 10394823 | 10404834 | +      | 10012nt | -        | imprecise | GENE2 | deletion,exon_loss_variant                  | ENST0000038273 | 2,3  |
| chr1  | 10394823 | 10404834 | +      | 10012nt | -        | imprecise | GENE2 | deletion,exon_loss_variant                  | ENST0000038284 | 2,3  |
| chr1  | 2784394  | 2984393  | +      | 20000nt | -        | imprecise | GENE3 | deletion,transcript_ablation                | ENST0000061234 |      |
| chr1  | 394823   | 399822   | +      | 5000nt  | 10000nt  | imprecise | GENE5 | duplication,partially_duplicated_transcript | ENST0000047283 | 4    |
| chr1  | 38584598 | 38683853 | +      | 99256nt | 198512nt | imprecise | GENE6 | duplication,transcript_amplification        | ENST0000038482 |      |
+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+

A VCF format specification document has a few structural variant examples:

#CHROM  POS   ID  REF ALT   QUAL  FILTER  INFO  FORMAT  NA00001
1 2827693   . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA  C . PASS  SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682    . T <DEL>   6 PASS    IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62  GT:GQ 0/1:12
2 14477084  . C <DEL:ME:ALU>  12  PASS  IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32  GT:GQ 0/1:12
3 9425916   . C <INS:ME:L1> 23  PASS  IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 12665100  . A <DUP>   14  PASS  IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500   GT:GQ:CN:CNQ  ./.:0:3:16.2
4 18665128  . T <DUP:TANDEM>  11  PASS  IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10  GT:GQ:CN:CNQ  ./.:0:5:8.3

Turning this into something like:

+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chrom | start    | end      | strand | ref                                                                   | alt     | imprecise | gene   | sequence ontology              |
+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chr1  | 2827693  | 2827693  | +      | CGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA | -       |           | GENE7  | deletion                       |
| chr2  | 321682   | 321887   | +      | 206nt                                                                 | -       | imprecise | GENE8  | deletion                       |
| chr2  | 14477084 | 14477085 | +      | -                                                                     | 297nt   | imprecise | GENE9  | insertion,Alu_insertion        |
| chr3  | 9425916  | 9425917  | +      | -                                                                     | 6027nt  | imprecise | GENE10 | insertion,LINE1_insertion      |
| chr3  | 12665101 | 12686200 | +      | 21100nt                                                               | 42200nt | imprecise | GENE11 | duplication                    |
| chr4  | 18665128 | 18665204 | +      | 77nt                                                                  | 154nt   | imprecise | GENE12 | duplication,tandem_duplication |
+-------+----------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+

Of course, INFO fields should be parsed and recorded in other columns.

Would something like the above work for your purposes? Any feedback/suggestion would be appreciated.

EugeneEA commented 2 years ago

@rkimoakbioinformatics thankt a lot for the replies! Ok, that looks a bit too verbouse, can we select the major transcript as we do for the SNPs?

Sequence ontology is extremely usefull field but it's aggregation in VEP (https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html) quite simplify the filtering, may be it is something worth the implementation (also for SNPs).

Would these variants be annotated if they are present in some of the annotators (clinvar for example) (I know that these are basically indels, but still might be usefull)

rkimoakbioinformatics commented 2 years ago

@EugeneEA Yes, if the variants are in ClinVar as well as any other OpenCRAVAT, they will be annotated. I am not sure yet about how imprecise variants are treated in annotation data sources, but that will be the spirit.

As far as I know, VEP outputs sequence ontologies for each transcript on separate lines in its native output format, or on the same line delimited in the VCF format. I am not aware of aggregation by VEP - does it aggregate? If you mean, by aggregation, something like showing all sequence ontologies from all the variants for a transcript together, that has been planned but we haven't gotten to work on it yet.

By selecting a major transcript, you mean the current OpenCRAVAT's style of showing the mutation consequence on a representative transcript, either a MANE one or a custom choice for a gene, and that on all the other transcripts where the variant falls in another column?

EugeneEA commented 2 years ago

@rkimoakbioinformatics By "VEP aggregation" I meant that they summarise sequence ontology from 30+ fileds to 4 (High, low, moderate, modifier) and provides it as an additional info column. It is a usefull feature for medical people, even though it is trivial to implement it might worth to include it into default output.

Yes, that is exectly what I meant either consequence or MANE, and the rest goes to other column

RachelKarchin commented 5 months ago

Hi EugeneEA,

Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.

EugeneEA commented 5 months ago

Hi EugeneEA,

Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.

Hi! Nothing above what was mentioned earlier so far, but in general, it would be super helpful if your SV support will follow the same frame as usual snp/INDELS module in terms of possibility of adding custom annotators.

For examples - we are analyzing a lot of samples with some SV detection tools and currently I have to annotate each new sample with the SV frequency from internal database using VEP + custom scripts. I'd love to switch to oc for both tasks. Therefore for me a "VEP aggregation" column (or set of columns which I can use as a secondary input for such annotator) and possibility to add custom annotator is a mast.

Best, Eugene

KarchinLab / open-cravat

Structural variant vcf annotation #97

--