Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

.out file contains repeat coordinates outside of possible range #155

Closed jamesdgalbraith closed 1 year ago

jamesdgalbraith commented 2 years ago

Describe the issue

Many repeat coordinates hits in the ".out" file are longer than the query repeat. For example, this hit from the ".out", which is to a repeat with a consensus which is 3217bp long, yet says it's a hit to 4804-5218:

15803 11.6 1.8 1.5 ctg_1 35113158 35113313 (105938908) + DR0021180 LINE/RTE-BovB 4804 5218 (195) 43774

However the likely corresponding line from the ".align" file do not have this issue: 15803 7.67 1.17 0.28 ctg_1 35110690 35112821 (105939400) DR0021180#LINE/RTE-BovB 1 2151 (1066) m_b606s001i18 43774

Of note there is are hits from other BovBs which overlap the coordinates(from .align):


2098 17.41 0.51 1.81 ctg_1 35112819 35113210 (105939011) DR0087524#LINE/RTE-BovB 1444 1830 (438) m_b606s001i21 43774

520 31.22 9.94 4.46 ctg_1 35112938 35113299 (105938922) DR0143386#LINE/RTE-BovB 4838 5218 (195) m_b606s001i22 43774

714 17.39 5.07 0.00 ctg_1 35113176 35113313 (105938908) DR0020736#LINE/RTE-BovB 2466 2610 (1) m_b606s001i23 43774

The library used was from

https://www.dfam.org/releases/Dfam_3.5/families/Dfam.h5.gz

Reproduction steps

  1. Steps to reproduce the behavior, including the command lines given to the program

RepeatMasker -species reptilia -pa 32 genome.fasta

The genome is currently under embargo

Log output

Please paste or attach any and all log output, which includes useful information including data file statistics and version numbers. An easy way to capture this is to redirect the log output to a file e.g RepeatMasker myseq.fa >& output.log

Environment (please include as much of the following information as you can find out):

Installed from website (version 4.1.2-p1)

Full Dfam as instructred by website

Linux 5.4.0-94-generic #106-Ubuntu

Additional context This is on a species of snake. I have an assumption this may be due to a growing level of redundancy with the reptile Dfam sequences

rmhubley commented 2 years ago

This is a "feature" of RepeatMasker and expected (albeit problematic) behavior. RepeatMasker has to deal with a wide range of TE library qualities including incomplete consensi, or missing details on closely related subfamilies. In such cases alignment of the sole sequence model can produce complex overlapping and conflicting alignments to a single locus. RepeatMasker or more specifically ProcessRepeats attempts to adjudicate the overlapping alignments and in some cases will merge two or more related annotations to generate an estimated annotation. In this case the family start/end positions within this family/subfamily are estimated based on the combination of alignments. Obviously this isn't ideal if you need to relate specific portions of this annotation to a TE model (consensus or pHMM), however that is where the .cat file helps. The .out is more of an expert interpretation of the alignments whereas the .cat file contains the actual alignments as found by the search engine. I typically use the .cat file to look deeper when an annotation is hard to interpret.

All that said, we are embarking on an effort to redesign the adjudication system. We want to make these "choices" among conflicting evidence more transparent, providing a rigorous confidence value for each call and offer alternative calls when there is low confidence among them. Hopefully this will clear up issues like this and make it easier for users to interpret the results.