Closed jonathancrabtree closed 1 year ago
This seems to be a more nuanced problem and simply adding back the Alu-like domain isn't sufficient. We've tried replacing the MELT SVA reference with SVA_A and SVA_F and actually obtained fewer ME calls:
1149 1149 16966 SVA_A-loci.txt
1143 1143 16878 SVA_F-loci.txt
1219 1219 17989 all-loci.txt
This may have been at least partially addressed by the v1.1.0 release, which made proportionally more SVA calls compared to
1.0.0:
1.0.0 HG00514-CCS-PAV-MEs-v1.0.0.txt
128 SVA
166 LINE1
925 ALU
1220 MEIs total
1.1.0 - 90%/90%/100 bp
316 SVA (2.47x)
249 LINE1 (1.5x)
1253 ALU (1.35x)
1818 MEIs total
1.1.0 - 90%/90%/95 bp
349 SVA
254 LINE1
1257 ALU
1860 MEIs total
Closing for now, can revisit if we decide that the polyA and overlapping SVA filters aren't doing a good enough job.
SVA identification isn't working because the SVA reference sequence does not contain the Alu-like domain (by design - it was causing too many spurious matches.) SVA = SINE-VNTR-Alu with 6 subfamilies.
From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3794087/: SVAs are composite elements consisting of multiple domains: a CCCTCT repeat, Alu-like domain, a GC-rich variable number of tandem repeat (VNTR), and SINE-R derived from the HERV-K LTR element [7-9, 21, 22]. They are flanked by target site duplications and terminate in a poly(A) tail (Fig. 1). In genomic sequence analysis, SVA elements are present in G + C-rich regions; however, they do not have any preferences for inter- or intragenic regions. SVA families are separated into six subfamilies (SVA-A to SVA-F), based upon point mutation and insertion and deletion events within the SINE-R [22]. Among them, four subfamilies (SVA-A, SVA-B, SVA-C, SVA-D) are present in gibbons and orangutans, while two subfamilies (SVA-E and SVA-F) are restricted to the human lineage
Figure 1 from the above paper: