Closed jamesdgalbraith closed 1 year ago
This is a "feature" of RepeatMasker and expected (albeit problematic) behavior. RepeatMasker has to deal with a wide range of TE library qualities including incomplete consensi, or missing details on closely related subfamilies. In such cases alignment of the sole sequence model can produce complex overlapping and conflicting alignments to a single locus. RepeatMasker or more specifically ProcessRepeats attempts to adjudicate the overlapping alignments and in some cases will merge two or more related annotations to generate an estimated annotation. In this case the family start/end positions within this family/subfamily are estimated based on the combination of alignments. Obviously this isn't ideal if you need to relate specific portions of this annotation to a TE model (consensus or pHMM), however that is where the .cat file helps. The .out is more of an expert interpretation of the alignments whereas the .cat file contains the actual alignments as found by the search engine. I typically use the .cat file to look deeper when an annotation is hard to interpret.
All that said, we are embarking on an effort to redesign the adjudication system. We want to make these "choices" among conflicting evidence more transparent, providing a rigorous confidence value for each call and offer alternative calls when there is low confidence among them. Hopefully this will clear up issues like this and make it easier for users to interpret the results.
Describe the issue
Many repeat coordinates hits in the ".out" file are longer than the query repeat. For example, this hit from the ".out", which is to a repeat with a consensus which is 3217bp long, yet says it's a hit to 4804-5218:
15803 11.6 1.8 1.5 ctg_1 35113158 35113313 (105938908) + DR0021180 LINE/RTE-BovB 4804 5218 (195) 43774
However the likely corresponding line from the ".align" file do not have this issue:
15803 7.67 1.17 0.28 ctg_1 35110690 35112821 (105939400) DR0021180#LINE/RTE-BovB 1 2151 (1066) m_b606s001i18 43774
Of note there is are hits from other BovBs which overlap the coordinates(from .align):
The library used was from
https://www.dfam.org/releases/Dfam_3.5/families/Dfam.h5.gz
Reproduction steps
RepeatMasker -species reptilia -pa 32 genome.fasta
The genome is currently under embargo
Log output
Please paste or attach any and all log output, which includes useful information including data file statistics and version numbers. An easy way to capture this is to redirect the log output to a file e.g
RepeatMasker myseq.fa >& output.log
Environment (please include as much of the following information as you can find out):
RepeatMasker -v
can be used to find this.Installed from website (version 4.1.2-p1)
Full Dfam as instructred by website
uname -a
andlsb_release -a
can be used to find this.Linux 5.4.0-94-generic #106-Ubuntu
Additional context This is on a species of snake. I have an assumption this may be due to a growing level of redundancy with the reptile Dfam sequences