Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

Different hits to repeats between BLASTN and RepeatMasker #203

Closed skagawa2 closed 1 year ago

skagawa2 commented 1 year ago

What do you want to know?

How to reproduce the hits from RepeatMasker (I used -e ncbi specifically) to debug why the output from BLASTN (default parameters) for a particular repeat sequence is different from the results in assembly.fasta.out{,.gff}.

Or if there are any differences between BLASTN (default parameters) and rmblastn in the context of RepeatMasker that might cause there to be significantly fewer hits (which also hit to different regions on the query) in the RepeatMasker output.

I apologize in advance if this was something stated in documentation that I had not found.

Helpful context

We are trying to annotate the draft assembly of Tenebrio molitor, which is known to have a high amount of a 142bp satellite sequence (Petitpierre et al. 1988). Our draft assembly was created from HiFi reads using hifiasm and purge_haplotigs.

We ran RepeatMasker with the following commands:

# RepeatModeler:
BuildDatabase -name <name> <genome.fasta>
RepeatModeler -database <name> -pa 32 -LTRStruct |& tee run.out

# RepeatMasker:
famdb.py families -ad --include-class-in-name --format fasta_name Insecta > Insecta_repeats.fasta
cat <name>-families.fa Insecta_repeats.fasta > custom_repeat_library.fasta
RepeatMasker -xsmall -no_is -html -gff -s -excln -lib custom_repeat_library.fasta -engine ncbi -pa 8 <genome.fasta>

The output shows that the satellite sequences make up 11.52% of the assembly, which is much higher than another species without this property, which makes sense. However, when we manually BLASTN the assembly for the published 142bp repeat, the hits are different than the ones in <genome.fasta>.out.

Note: L19313.1 (the published 142bp repeat) and TMSATE1 (in RepBase RepeatMasker) match for 138 out of 142 bp, so we treated them as essentially the same sequences. This was the only repeat within the repeat library that had a hit to L19313.1 for most of its length.

Here is an excerpt of <genome.fasta>.out:

   SW   perc perc perc  query       position in query              matching           repeat                  position in repeat
score   div. del. ins.  sequence    begin    end          (left)   repeat             class/family      begin    end    (left)       ID
  579    3.0  0.0  0.0  ptg000002l  11826910 11826975  (1368605) C TMSATE1            Satellite              (4)    138       73  19407  
  578    1.6  0.0  0.0  ptg000002l  11935721 11935784  (1259796) C TMSATE1            Satellite              (4)    138       75  19409  
  580    0.0  0.0  0.0  ptg000002l  12598447 12598508   (597072) C TMSATE1            Satellite             (80)     62        1  19451  
  564    3.1  0.0  0.0  ptg000003l  20330833 20330897   (917466) + TMSATE1            Satellite               74    138      (4)  54159  
  534    3.1  0.0  1.5  ptg000004l  23907096 23907161  (2110575) + TMSATE1            Satellite               74    138      (4)  90109  
  534    3.1  0.0  1.5  ptg000004l  23939930 23939995  (2077741) + TMSATE1            Satellite               74    138      (4)  90126  
  569    1.6  0.0  0.0  ptg000004l  24931520 24931582  (1086154) + TMSATE1            Satellite               76    138      (4)  90167  
  534    4.6  0.0  0.0  ptg000005l  25943612 25943676  (2545145) + TMSATE1            Satellite               74    138      (4) 129914  
  549    4.6  0.0  0.0  ptg000005l  26070212 26070276  (2418545) C TMSATE1            Satellite              (4)    138       74 129924  
  574    3.1  0.0  0.0  ptg000005l  26116036 26116100  (2372721) + TMSATE1            Satellite               74    138      (4) 129926  
  588    1.5  0.0  0.0  ptg000005l  26173300 26173364  (2315457) C TMSATE1            Satellite              (4)    138       74 129927  
  580    0.0  0.0  0.0  ptg000005l  26248302 26248363  (2240458) C TMSATE1            Satellite             (80)     62        1 129931  
  588    1.5  0.0  0.0  ptg000005l  26770691 26770755  (1718066) C TMSATE1            Satellite              (4)    138       74 129936  
  588    1.5  0.0  0.0  ptg000005l  26824772 26824836  (1663985) C TMSATE1            Satellite              (4)    138       74 129941  
  588    1.5  0.0  0.0  ptg000005l  26855728 26855792  (1633029) C TMSATE1            Satellite              (4)    138       74 129942  
  574    3.1  0.0  0.0  ptg000005l  26880601 26880665  (1608156) + TMSATE1            Satellite               74    138      (4) 129944
... (144 more lines)

Of note, the hits are to 73-75 to 138 or 1 to 62 on TMSATE1. Compare this to a manual BLASTN (default parameters) of the assembly with L19313.1 (blastn outfmt 6):

qseqid          sseqid          pident  length  mismatch gapopen qstart  qend    sstart          send            evalue          bitscore
L19313.1        ptg000002l      96.610  59      2        0       1       59      10806379        10806321        5.70e-20        99.0
L19313.1        ptg000002l      92.806  139     10       0       1       139     10806521        10806383        4.23e-51        202
L19313.1        ptg000002l      88.732  142     15       1       1       141     10806664        10806523        3.31e-42        172
L19313.1        ptg000002l      91.489  141     12       0       1       141     10806806        10806666        7.07e-49        195
L19313.1        ptg000002l      92.199  141     11       0       1       141     10806948        10806808        1.52e-50        200
L19313.1        ptg000002l      91.489  141     12       0       1       141     10807090        10806950        7.07e-49        195
L19313.1        ptg000002l      90.780  141     13       0       1       141     10807232        10807092        3.29e-47        189
L19313.1        ptg000002l      90.780  141     13       0       1       141     10807374        10807234        3.29e-47        189
L19313.1        ptg000002l      88.732  142     15       1       1       141     10807517        10807376        3.31e-42        172
L19313.1        ptg000002l      90.780  141     13       0       1       141     10807801        10807661        3.29e-47        189
L19313.1        ptg000002l      88.571  140     16       0       2       141     10807942        10807803        1.19e-41        171
L19313.1        ptg000002l      90.647  139     13       0       1       139     10808085        10807947        4.26e-46        185
L19313.1        ptg000002l      90.780  141     13       0       1       141     10808227        10808087        3.29e-47        189
L19313.1        ptg000002l      92.908  141     10       0       1       141     10808369        10808229        3.27e-52        206
L19313.1        ptg000002l      90.071  141     14       0       1       141     10808653        10808513        1.53e-45        183
... (246,774 more lines)

Here, the hits are along almost the entire length of L19313.1 (from base 1 to 141), and at a much higher frequency and density. Most regions have mismatches <10, this was just an output with columns 2 and 9,10 sorted.

We wanted to know if there was a reason why this discrepancy could be the case. Unfortunately, the currently published T. molitor assemblies don't have this same satellite repeat in as high of a frequency, so there isn't a public assembly we can direct you to, but we can send you our assembly if necessary. I'm not sure if this is an important fact, but almost all occurrences of this repeat occur at the end of our contigs.

rmhubley commented 1 year ago

Thanks for all the detailed information. I suspect you filtered the output in ".out:" to only show several examples of "TMSATE1" for me? I wonder if you might show a few of them with the full set of annotations flanking them? I am curious if TRF interfered with the full-length identification of that consensus. Satellites are tricky as they are handled in two different ways in RepeatMasker (currently). TRF is run prior to searches against the library to identify sequences that could lead to false positive matches to TE consensi. If there is a characterized satellite in the TE library it can be eclipsed by the TRF results. Let's see if this is the case.

skagawa2 commented 1 year ago

It seems like you were correct! I hadn't even considered that, my bad. I attached the output of grep -A10 -B10 "TMSATE1" <genome.fasta>.out and the FASTA sequence of some of the repeats that were flanking the TMSATE1 in the output. I wasn't able to figure out the repeat coordinates in the <genome.fasta>.out file because some of them seem to extend beyond the length of the repeat (I hope I'm not extracting the wrong sequences), so I couldn't try to stitch together the sequences to get back the original satellite sequence, but here are the flanking annotations and a FASTA file with the sequences of the repeats that seem to flank the TMSATE1 annotation.

First block of entries in grepA10B10_TMSATE1_TmolPHfasta.txt (two lines with TMSATE1):

 4564    2.8  0.1  0.1  ptg000002l  10887133 10896849  (2298731) C rnd-5_family-5643  Satellite            (317)   9714        1  19398 *
 4589    2.7  0.1  4.0  ptg000002l  10896424 10897125  (2298455) + rnd-3_family-417   Satellite                1    675      (0)  19399  
 4465    4.3  0.4  0.0  ptg000002l  10896985 10903729  (2291851) C rnd-5_family-5643  Satellite            (470)   6770        1  19400 *
 9038    2.6  0.1  0.1  ptg000002l  10903588 11321023  (1874557) + rnd-5_family-5643  Satellite                1 417728    (402)  19401  
 4728    2.5  4.9  0.0  ptg000002l  11321027 11321582  (1873998) + rnd-6_family-933   Unknown                  1    583      (0)  19402  
 9126    3.0  0.6  0.5  ptg000002l  11321587 11595043  (1600537) C rnd-5_family-5643  Satellite             (61) 273593        1  19403  
 1463    3.0  0.0  0.0  ptg000002l  11594902 11595067  (1600513) + rnd-3_family-417   Satellite                1    166    (405)  19404 *
   14   19.9  1.7  7.1  ptg000002l  11597489 11597547  (1598033) + (ATATTTT)n         Simple_repeat            1     56      (0)  19405  
 4981    2.5  0.0  0.9  ptg000002l  11598042 11601129  (1594451) + rnd-3_family-417   Satellite                1   3058    (470)  19404 *
 9120    3.1  0.6  0.5  ptg000002l  11600704 11826909  (1368671) C rnd-5_family-5643  Satellite            (470) 330786   104311  19406  
  579    3.0  0.0  0.0  ptg000002l  11826910 11826975  (1368605) C TMSATE1            Satellite              (4)    138       73  19407  
 9120    3.1  0.6  0.5  ptg000002l  11826976 11931159  (1264421) C rnd-5_family-5643  Satellite            (470) 104310        1  19406  
 4339    3.6  0.5  0.7  ptg000002l  11930735 11931262  (1264318) + rnd-3_family-417   Satellite                1    527     (61)  19408 *
 9009    2.8  0.4  0.4  ptg000002l  11931252 11935720  (1259860) + rnd-5_family-5643  Satellite                1   4472 (129497)  19408  
  578    1.6  0.0  0.0  ptg000002l  11935721 11935784  (1259796) C TMSATE1            Satellite              (4)    138       75  19409  
 9009    2.8  0.4  0.4  ptg000002l  11935785 12064670  (1130910) + rnd-5_family-5643  Satellite             4473 133459    (510)  19408  
 1397    4.3  0.0  0.0  ptg000002l  12064529 12064690  (1130890) + rnd-3_family-417   Satellite                1    162    (409)  19410 *
 4829    2.7  0.0  0.3  ptg000002l  12064682 12066857  (1128723) + rnd-5_family-5643  Satellite                1   2169    (470)  19411  
 1347    5.6  0.0  0.0  ptg000002l  12066716 12066877  (1128703) + rnd-3_family-417   Satellite                1    162    (409)  19412 *
 4810    2.8  0.0  0.0  ptg000002l  12066869 12068473  (1127107) + rnd-5_family-5643  Satellite                1   1605    (470)  19413  
 1420    3.7  0.0  0.0  ptg000002l  12068332 12068493  (1127087) + rnd-3_family-417   Satellite                1    162    (409)  19414 *
 4738    3.0  0.1  0.4  ptg000002l  12068485 12069950  (1125630) + rnd-5_family-5643  Satellite                1   1461    (464)  19415  
 3413    4.2  0.0  0.0  ptg000002l  12069809 12070352  (1125228) + rnd-3_family-417   Satellite                1    544    (409)  19416 *
 4786    3.0  0.0  0.3  ptg000002l  12070344 12072901  (1122679) + rnd-5_family-5643  Satellite                1   2551    (470)  19417  
 1347    5.6  0.0  0.0  ptg000002l  12072760 12072921  (1122659) + rnd-3_family-417   Satellite                1    162    (409)  19418 *
--
...

and here's a list of some of the sequences found near TMSATE1 with added length annotation (in repeats_detected_near_TMSATE1.fasta.txt):

>TMSATE1#Satellite @Tenebrio_molitor  <length: 142>
GAATTCTGTAATTCTTGCGTCGTTTTACTTCGAAATGTACAAGTTCCACGACGAAATTCC
GATTCGCACTTAGTTTTTCGTGATTCTACACAGTTGCGAGCGAAAAAACGTATTTAGAGG
AAAGTTAGCGTCTTGGAAACAG
>rnd-3_family-417#Satellite ( Recon Family Size = 33, Final Multiple Alignment Size = 33 )  <length: 571>
AAATTCCGACTCGCACTTAGTTTTTCGTGATCCTACAGAGTTGCGAGCGAAAAAACGTAT
TTAGAGGAAAGTTAGCGTCTTGGAACCTGGAATTCTGTAGTTCTTGCGTCGTTTTACTTC
GAAATGTACAAGTTCCACGACGAAATTCCGATTCGCACTTAGTTTTTCGTGATCCTACAC
AGTTGCGAGCGAAAAAACGTATTTAGAGGAAAGTTAGCGTCTTGGAACCTGGAATTCTGT
AATTCTTGCGTCGTTTTACTTCGAAATGTACAAGTTCCACGACGAAATTCCGATTCGCAC
TTAGTTTTTCGTGATCCTACACAGTTGCGAGCGAAAAAACGTATTTAGAGGAAAGTTAGC
GTCTTGGAACCTGGAATTCTGTAGTTCTTGCGTCGTTTTACTTCGAAATGTACAAGTTCC
ACGACGAAATTCCGATTCGCACTTAGTTTTTCGTGATCCTACACAGTTGCGAGCGAAAAA
ACCTATTTAGAGGAAAGTTAGCGTTTGAAAAACTAAGTGCGAATCGGAATTTCGTCGTGG
AACTTGTACATTTCGAAGTAAAACGACGCAA
>rnd-5_family-2118#Unknown ( Recon Family Size = 25, Final Multiple Alignment Size = 25 )  <length: 384>
ATATCGCCAAATCTCAAAAATCACACTAAAAACGTTCCGAATCGAAATTCTGTCGTTGAA
CTTGCTCATTTGTATTTAAAATATAGCAACAATTTATAGAGATCGCTGTTTTGAGGCGCC
AACATTTCTCTAAATTCCGAATCGAAAATTTTACCTTTAGACAAGTTTTTTTTCCGTAAA
CAGCTCGTATCAATCGACTGAAAGCCGCAAGTTATTATTTGCATGATTTATTGTACACTT
GAATGCGAAAATTTCAAAAATTCCTGTGTTTTAATGAGTGATATTACATATCAAGACAGG
CGGAATTTTGAAAACACCAACTGAAAGCTGCGCGCTGCCTCCGCGAGGCGGCGCGGCAGT
CGTCAGAAAACGATATCGGGCAAG
>rnd-5_family-5643#Satellite ( Recon Family Size = 17, Final Multiple Alignment Size = 17 )  <length: 1007>
CCTACAGAGTTGCGAGCGAAAAAACGTATTTAGAAGAAAGTTAGCGTCTTGGAACCTGGA
ATTCTGTAGTTCTTGCGTCGTTTTACTTCGAAATGTACAAGTTCCACGACGAAATTCCGA
TTCGCACTTAGTTTTTCGTGATCCTACACAGTTGCGAGCGAAAAAACGTATTTAGAGGAA
AGTTAGCGTCTTGGAACCTGGTATTCTGTAATTCTTGCGTCGTTTTACTTCGAAATGTAC
AAGTTCCACGACGAAATTCCGATTCGCACTTAGTTTTTCGTGATCCTACACAGTTGCGAG
CGAAAAAACGTATTTAGAGGAAAGTTAGCGTCTTGGAACCTGGAATTCTGTAGTTCTTGC
GTCGTTTTACTTCGAAATGTACAAGTTCCACGACGAAATTCCGATTCGCACTTAGTTTTT
CGTGATCCTACACAGTTGCGAGCGAAAAAACGTATTTAGAGGAAAGTTAGCGTCTTGGAA
CCTGGAATTCTGTAGTTCTTGCGTCGTTTTACTTCGAAATGTACAAGTTCCACGACGCTA
ACTTTCTTCTAAATACGTTTTTTCGCTCGCAACTGTGTAGGATCACGAAAAACTAAGTGC
GAATCGGAATTTCGTCGTGGAACTTGTACATTTCGAAGTAAAACGACGCAAGATATACAG
AATTCCAGGTTCCAAGACGCTAACTTTCCTCTAAATACGTTTTTTCGCTCGCAACTNTGT
AGGATCACGAAAAACTAAGTGCGAATCGGAATTTCGTCGTGGAACTTGTACATTTCGAAG
TAAAACGACGCAAGAATTACAGAATTCCAGGTTCCAAGACGCTAACTTTCCTCTAAATAC
GTTTTTTCGCTCGCAACTGTGTAGGATCACGAAAAACTAAGTGCGAATCGGAATTTCGTC
GTGGAACTTGTACATTTCGAAGTAAAACGACGCAAGAACTACAGAATTCCAGGTTNCAAG
ACGCTAACTCNCCTCTAAATACGTTTTTTCGCTCGCAACTGTGTAGG

Thank you for the help so far!

skagawa2 commented 1 year ago

I'm sorry I forgot to include this information in the original post, but the version used there was RepeatModeler v2.0.3 and RepeatMasker v4.1.2-p1. I have since re-run these with RepeatModeler v2.0.4 and RepeatMasker v4.1.4 and got similar results. I can include information from the updated run if requested.

rmhubley commented 1 year ago

Right. So what has happened is that RepeatModeler rediscovered TMSATE1 and due to the imperfect nature of tandem satellite copies, it created two mosaic families from the data:

rnd-3_family-417

   729  3.5  0.0  0.0 rnd-3_family-417#Satellite        1       85    (486) + TMSATE1                              54     138     (4)              
  1244  1.4  0.0  0.0 rnd-3_family-417#Satellite       90      227    (344) + TMSATE1                               1     138     (4) 
  1256  0.7  0.0  0.0 rnd-3_family-417#Satellite      232      369    (202) + TMSATE1                               1     138     (4)              
  1172  2.9  0.7  0.0 rnd-3_family-417#Satellite      374      511     (60) + TMSATE1                               1     139     (3)  
  590  0.0  0.0  0.0 rnd-3_family-417#Satellite      507      571      (0) C TMSATE1                            (63)      79      15                                     

And rnd-5_family-5643

   431  3.8  0.0  0.0 rnd-5_family-5643#Satellite        2       54    (953) + TMSATE1                              86     138     (4)              
  1244  1.4  0.0  0.0 rnd-5_family-5643#Satellite       59      196    (811) + TMSATE1                               1     138     (4)              
  1237  0.7  0.0  0.0 rnd-5_family-5643#Satellite      203      338    (669) + TMSATE1                               3     138     (4)              
  1244  1.4  0.0  0.0 rnd-5_family-5643#Satellite      343      480    (527) + TMSATE1                               1     138     (4)              
   471  1.9  0.0  0.0 rnd-5_family-5643#Satellite      485      537    (470) + TMSATE1                               1      53    (89)              
  1165  3.6  0.0  0.0 rnd-5_family-5643#Satellite      528      665    (342) C TMSATE1                             (4)     138       1              
  1243  1.4  0.0  0.0 rnd-5_family-5643#Satellite      670      807    (200) C TMSATE1                             (4)     138       1              
  1244  1.4  0.0  0.0 rnd-5_family-5643#Satellite      812      949     (58) C TMSATE1                             (4)     138       1              
   449  5.7  0.0  0.0 rnd-5_family-5643#Satellite      954     1006      (1) C TMSATE1                             (4)     138      86             

Tandem repeats (and satellites) are better handled by other types of repeat discovery tools such as Ultra, and TRF. When we curate a de-novo library we typically screen out Satellite sequences unless there is 1) a compelling reason to name the satellite sequence (significant coverage, scientifically interesting etc.) or 2) the standard method of identifying tandem repeats in the genome (TRF/Ultra) are not adequately annotating them in our RepeatMasker runs.

skagawa2 commented 1 year ago

Thank you! That makes a lot of sense. I'll reopen this if I have any more questions.