edg1983 / GREEN-VARAN

Annotate non-coding regulatory vars using our GREEN-DB, prediction scores, conservation and pop AF
MIT License
17 stars 6 forks source link

SV is not annotated with some regions despite overlap #15

Closed jsquaredosquared closed 5 months ago

jsquaredosquared commented 7 months ago

TL;DR

Why is the SV chr14_73,085,269_73,141,832_INV not annotated with the promoter 580535_pro (chr14_73,135,539_73,138,139) despite the promoter being fully within the SV region?


Thanks for this helpful tool!

Problem

I am having trouble understanding why a particular SV is not annotated with a particular region despite the SV region containing the regulatory region.

The SV I am trying to annotate is an inversion (chr14_73,085,269_73,141,832_INV).

chr14 | 73085269 | MantaINV:113901:0:2:0:0:0 | G | <INV> | 389 | PASS | END=73141832;SVTYPE=INV;SVLEN=56563;CIPOS=0,25;CIEND=-25,0;HOMLEN=25;HOMSEQ=GCCTCCCAAAGTGCTGGGATTACAG;INV3;SVCALLER=MANTA;INV5 | GT:FT:GQ:PL:PR:SR

I checked GREEN-DB, and the promoter 580535_pro (chr14_73,135,539_73,138,139) is fully contained within the inverted region.

580535_pro | chr14 | 73135539 | 73138139 | active_promoter,promoter | promoter | BENGI,DECRES,FOCS,ENCODE-HMM | deep_learning,HMM_prediction,roadmap | 0.129 | 0.620575 | PSEN1 | PSEN1 | 0 | PSEN1 | 0 | GM12878,HMEC,HUVEC,HelaS3,HepG2,K562,H1-hESC,HSMM,NHEK,NHLF | Sporadic,Bilateral tonic-clonic   seizure,Primitive reflex,Alexia,Mutism,Abulia,Abnormality of vision,Memory   impairment,Restlessness,Seizure,Language impairment,Cerebral cortical   atrophy,Inappropriate laughter,Dysphasia,Senile plaques,Abnormality of the cerebral   white matter,Visual agnosia,Neurofibrillary tangles,Abnormal lower motor   neuron morphology,Hypertonia,Parkinsonism,Abnormality of neutrophils,Brain   atrophy,Depressivity,Abnormal brain FDG positron emission   tomography,Grammar-specific speech disorder,Aphasia,Anomia,Dyslexia,Anxiety,Motor   aphasia,Hyperorality,Congestive heart failure,Lack of insight,Poor   speech,Elevated serum creatine kinase,Abnormality of extrapyramidal motor   function,Agnosia,Spoken Word Recognition Deficit,EEG with continuous slow   activity,Dysgraphia,Spastic tetraparesis,Alzheimer disease,EMG   abnormality,Frontotemporal dementia,Neuronal loss in central nervous   system,Restrictive behavior,Semantic dementia,Fasciculations,Thickened nuchal   skin fold,Frontotemporal cerebral atrophy,Echolalia,Inappropriate   behavior,Palmoplantar keratoderma,Astrocytosis,Babinski   sign,Hyperreflexia,Abnormal social behavior,Apraxia,Intellectual   disability,Personality changes,Agitation,Dilated   cardiomyopathy,Perifolliculitis,Optic ataxia,Irritability,Acne   inversa,Frontal lobe dementia,Temporal cortical atrophy,Adult   onset,Polyphagia,Gait   disturbance,Dementia,Heterogeneous,Myopathy,Confusion,Finger   agnosia,Lipoatrophy,Perseveration,Psychosis,Loss of speech,Upper motor neuron   dysfunction,Chronic furunculosis,Syncope,Dystonia,Amyotrophic lateral   sclerosis,Dyscalculia,Collectionism,Deposits immunoreactive to beta-amyloid   protein,Dysphagia,Disinhibition,Recurrent cutaneous abscess   formation,Myoclonus,Sensorineural hearing impairment,Hallucinations,Stereotypy,Oculomotor   apraxia,Ataxia,Emotional blunting,Inappropriate sexual behavior,Lower limb   hyperreflexia,Autosomal dominant inheritance,Apathy,Rapidly   progressive,Gliosis,Dysarthria,Aggressive behavior

However, the SV is not annotated with this promoter. There are some other regulatory regions contained in the inverted region (e.g., 145783_pro at chr14:73,137,653-73,137,980), but these are also not added to the annotations.

The SV is only annotated with 417710_enh (chr14:73,084,745-73,092,455).

Command

greenvaran sv \
-i input.vcf \
-o output.vcf \
-d GRCh38_GREEN-DB.bed.gz \
-s greendb_schema_v2.5.json \
-g gene_list.txt \
-p 1
jsquaredosquared commented 5 months ago

@edg1983 do you have any idea what could be going on? Any help would be appreciated 🥲

edg1983 commented 5 months ago

Hello!

I'm sorry for taking so long to fix this. I've investigated using your test case and found a bug causing SV intervals not to be generated properly. Essentially, the intervals were evaluated as just POS-POS+1 instead of POS-SVEND. It is now fixed in v1.3 (commit 0d2ebfc).

I've tested with some variants, including your example, and I can now get all the expected overlapping events. Please update to the new release (v1.3) and confirm it's running correctly.

This is the comparison for your specific event INV chr14:73085269-73141832

#Expected   Observed
145767_pro  145767_pro
145768_pro  145768_pro
145769_pro  145769_pro
145770_pro  145770_pro
145771_pro  145771_pro
145772_pro  145772_pro
145773_pro  145773_pro
145774_pro  145774_pro
145775_pro  145775_pro
145776_pro  145776_pro
145777_pro  145777_pro
145778_pro  145778_pro
145779_pro  145779_pro
145780_pro  145780_pro
145781_pro  145781_pro
145782_pro  145782_pro
145783_pro  145783_pro
1553503_enh 1553503_enh
1553504_enh 1553504_enh
1553505_enh 1553505_enh
1553506_enh 1553506_enh
1553507_enh 1553507_enh
1553508_enh 1553508_enh
1553509_enh 1553509_enh
1800077_enh 1800077_enh
417710_enh  417710_enh
417711_enh  417711_enh
417712_enh  417712_enh
417713_enh  417713_enh
417714_enh  417714_enh
417715_enh  417715_enh
417716_enh  417716_enh
417717_enh  417717_enh
417718_enh  417718_enh
417719_enh  417719_enh
417720_enh  417720_enh
417721_enh  417721_enh
417722_enh  417722_enh
417723_enh  417723_enh
417724_enh  417724_enh
417725_enh  417725_enh
417726_enh  417726_enh
417727_enh  417727_enh
417728_enh  417728_enh
417729_enh  417729_enh
417730_enh  417730_enh
417731_enh  417731_enh
417732_enh  417732_enh
417733_enh  417733_enh
417734_enh  417734_enh
417735_enh  417735_enh
417736_enh  417736_enh
417737_enh  417737_enh
417738_enh  417738_enh
417739_enh  417739_enh
417740_enh  417740_enh
417741_enh  417741_enh
417742_enh  417742_enh
417743_enh  417743_enh
566516_pro  566516_pro
566518_pro  566518_pro
580535_pro  580535_pro
jsquaredosquared commented 5 months ago

Thank you so much for looking into it!

Just one thing: you mentioned the interval is now evaluated from POS and [SV]END. But from my experience, sometimes the END is not guaranteed to be present. SV callers such as Manta and TARDIS do not always report it. The extent of the SV may need to be inferred from POS and SVLEN, which may be present when END is not present. (I think this is how vcfanno does it.)

Perhaps this is something that could be taken into account, so that even if END is not present the interval could still be calculated?


P.S., I couldn't quickly find an example that overlaps a regulatory region, but here are some examples where END is not reported (usually when the SV is PRECISE, and the exact sequence is given in ALT instead of just a symbolic allele):

Example from Manta:

chr1    4129813 MantaDEL:332:0:0:0:0:0  TCCTCTGCCCCACGTGGCGTTGGGGTCCCAGCAGGAGATGTGGGCCGGGAGCCTGTTGATC   T   267 PASS    SVTYPE=DEL;SVLEN=-60;CIGAR=1M60D;CIPOS=0,45;HOMLEN=45;HOMSEQ=CCTCTGCCCCACGTGGCGTTGGGGTCCCAGCAGGAGATGTGGGCC  GT:FT:GQ:PL:PR:SR

Example from TARDIS:

chr1    14841386    vh_del_21   acagagcaagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGAAAT   a   255 PASS    SVLEN=-61;RPSUP=6;SRSUP=12;CIEND=0,0;CIPOS=0,0;PRECISE;SVALG=TARDIS_v1.0.9;SVTYPE=DEL   GT:CNVL:RP:SR
edg1983 commented 5 months ago

Hi,

The logic built into the tool to create SV intervals is more complex and works as follows. I've now made it more robust, incorporating your suggestions, and created a new release (1.3.1). Please use this one.

For DEL, DUP, INV:

  1. When END is present in INFO, this is used as end_position
  2. Otherwise, it looks for SVLEN in INFO, takes the absolute value of this, and sets end_position=POS+abs(SVLEN)
  3. When both are missing, the variant is skipped
  4. When the configured CIPOS field is present in INFO, this is used to correct the start and end positions (if missing, then is set to zero)

Since INS or BND are mostly represented by single positions in VCF, i.e. SVLEN=1 and END=POS+1, you can configure a manual padding value using --padding. This padding value is added/subtracted after all previous steps when SVTYPE equals INS or BND.

Some examples

CHROM POS SVTYPE SVLEN END CIPOS PADDING INTERVAL
chr1 1000 DEL -100 1100 -5,10 5 chr1:995-1110
chr1 1000 DEL -200 NA -5,10 5 chr1:995-1210
chr1 1000 DEL -100 1100 NA 5 chr1:1000-1100
chr1 1000 DEL -200 NA NA 5 chr1:1000-1200
chr1 1000 INS 1 1001 -5,10 5 chr1:990-1016
chr1 1000 INS 1 1001 NA 20 chr1:980-1021
chr1 1000 INS 1 1001 NA NA chr1:1000-1001
jsquaredosquared commented 5 months ago

Awesome! Thanks again :smile: :+1:

edg1983 commented 5 months ago

Solved in version 1.3.1