Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

Linking LTRs and internal regions for ERV: Further postprocessing necessary? #288

Open osthomas opened 1 month ago

osthomas commented 1 month ago

Dear all,

I am looking into ERVs in the mouse genome (GRCm39), and I am a bit confused about relevant post processing steps.

There is a script available to combine ERV LTRs with internal regions, based on the names of the elements (https://mobilednajournal.biomedcentral.com/articles/10.1186/1759-8753-5-13). However, I am not sure if this is (still?) required, or if ProcessRepeats does this already.

Here is one example from the .out file in which LTRs and internal regions were linked already via the ID column:

  532    6.9  8.9  4.2  1             3122749   3123277 (192031002) + ERVB4_2-LTR_MM   LTR/ERVK                     1    553     (0)      98      
  689    5.8  9.9  0.8  1             3123278   3124349 (192029930) + ERVB4_2-I_MM     LTR/ERVK                     1   1972  (6402)      98      
  162    3.5  0.0  0.0  1             3124347   3124487 (192029792) + ERVB4_2-I_MM     LTR/ERVK                  5825   5965  (2409)      98 *    
 2116    6.8  5.8  0.8  1             3124478   3126963 (192027316) + RLTR45-int       LTR/ERVK                  1318   3204  (4040)      99      
  626    2.6  2.9  2.2  1             3126958   3127550 (192026729) + RLTR45-int       LTR/ERVK                  3390   3986  (3258)      99 *    
 2455    3.3  0.6  0.2  1             3127544   3129715 (192024564) + RLTR45-int       LTR/ERVK                  5065   7244     (0)      99      
  544    6.3  8.7  4.5  1             3129716   3130247 (192024032) + ERVB4_2-LTR_MM   LTR/ERVK                     1    553     (0)      99      

In this particular case, joining by name would not even catch it.

ProcessRepeats seems to do something with LTRs/ints. Are there cases that ProcessRepeats might miss, which may benefit from further parsing?