Scott-Devine / MELT-LRA

MELT-LRA: Mobile Element Insertion Site Classifier
Other
0 stars 0 forks source link

Fix SVA identification #4

Closed jonathancrabtree closed 1 year ago

jonathancrabtree commented 1 year ago

SVA identification isn't working because the SVA reference sequence does not contain the Alu-like domain (by design - it was causing too many spurious matches.) SVA = SINE-VNTR-Alu with 6 subfamilies.

From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3794087/: SVAs are composite elements consisting of multiple domains: a CCCTCT repeat, Alu-like domain, a GC-rich variable number of tandem repeat (VNTR), and SINE-R derived from the HERV-K LTR element [7-9, 21, 22]. They are flanked by target site duplications and terminate in a poly(A) tail (Fig. 1). In genomic sequence analysis, SVA elements are present in G + C-rich regions; however, they do not have any preferences for inter- or intragenic regions. SVA families are separated into six subfamilies (SVA-A to SVA-F), based upon point mutation and insertion and deletion events within the SINE-R [22]. Among them, four subfamilies (SVA-A, SVA-B, SVA-C, SVA-D) are present in gibbons and orangutans, while two subfamilies (SVA-E and SVA-F) are restricted to the human lineage

Figure 1 from the above paper:

Screen Shot 2023-03-30 at 8 25 23 AM

jonathancrabtree commented 1 year ago

This seems to be a more nuanced problem and simply adding back the Alu-like domain isn't sufficient. We've tried replacing the MELT SVA reference with SVA_A and SVA_F and actually obtained fewer ME calls:

    1149    1149   16966 SVA_A-loci.txt
    1143    1143   16878 SVA_F-loci.txt
    1219    1219   17989 all-loci.txt
jonathancrabtree commented 1 year ago

This may have been at least partially addressed by the v1.1.0 release, which made proportionally more SVA calls compared to

1.0.0:
1.0.0 HG00514-CCS-PAV-MEs-v1.0.0.txt
 128 SVA
 166 LINE1
 925 ALU
1220 MEIs total

1.1.0 - 90%/90%/100 bp
 316 SVA               (2.47x)
 249 LINE1             (1.5x)
1253 ALU               (1.35x)
1818 MEIs total

1.1.0 - 90%/90%/95 bp
 349 SVA
 254 LINE1
1257 ALU
1860 MEIs total
jonathancrabtree commented 1 year ago

16 essentially solves this problem. With respect to the SVA calls the following two sets are roughly coincident:

jonathancrabtree commented 1 year ago

Closing for now, can revisit if we decide that the polyA and overlapping SVA filters aren't doing a good enough job.