RepeatModeler2 duplicate sequences within the same alignment

Isaac1293 commented 4 years ago

Hi, I was trying to convert the stk output file in a hmm, however, I got an interesting result with hmmbuild. It seems like there are duplicated sequences within the same alignment (coordinates), these means that hmmer could overestimate the probabilities of certain bases in certain positions. BuildDatabase -name hbakeri nHp.2.0.fasta RepeatModeler -database hbakeri -pa 2 -LTRStruct >& run.out

The error in hmmbuild is:

Alignment input parse error: duplicate seq name nHp.2.0.scaf00731:62157-62226 while reading Stockholm file hbakeri-families.stk at or near line 11855

Using sed in found that the first and the last sequences in the output are duplicated: sed '11836,11855!d' hbakeri-families.stk

The ID is nHp.2.0.scaf00731:62157-62226 and here is the stk file hbakeri-families.stk.tar.gz

rmhubley commented 4 years ago

Thanks for the report and the data. I have verified that the duplicate was generated by the RepeatScout portion of the pipeline and I am looking into why that happened. Would you be able to make available the nHp.2.0.fasta file and random number seed printed out near the top of your run.out file? The line would look something like this: "Random Number Seed: 1595354491"

Isaac1293 commented 4 years ago

The random number seed is: Random Number Seed: 1580943197 the fasta file is bigger than 10 Mb even as tar.gz file. Is there another way to share it with you ? It could also be downloaded as : wget ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/current/species/heligmosomoides_polygyrus/PRJEB15396/heligmosomoides_polygyrus.PRJEB15396.WBPS14.genomic.fa.gz

rmhubley commented 4 years ago

Thanks for the files! This was a difficult one to track down. I don't expect this to happen very often. It's the result of the way representative sequences are generated for RepeatScout consensi. Unfortunately RepeatScout in its current form does not provide the sequence ranges it used to generate a given family consensus. So RepeatModeler has use the consensus to go back and find examples in the genomic sample used by RepeatScout. There is a chance that you can retrieve two alternative overlapping alignments in the initial stages of refinement, only to end up with two identical subregions remaining at the end of the refinement steps. We are working on a new version of RepeatScout with lots of changes including the ability to generate the sequence range list and in the long term this problem will go away. For now I encoded a check in our Refiner script to detect these duplicates and remove them before generating the Stockholm file. You could also delete this identical sequence yourself from that one file ( and adjust the "SQ" sequence count in the header ) and be just fine. Once I have done some more testing I will check in the RepeatModeler changes to github and push out a release.

Isaac1293 commented 4 years ago

Thank you for your reply, As you suggested, I removed the duplicate sequences an edited the SQ line. However, so far I can't run repeatmasker with the hmm file. I have been using: hmmbuild hbakeri-families.hmm hbakeri-families.stk then RepeatMasker -gff -e hmm -library hbakeri-families.hmm nHp.2.0.fasta

But I get " WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac/Software/hmmer-3.3/src/nhmmscan --cut_ga --cpu 2 /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/ hbakeri-families.hmm /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/nHp.2.0.fasta_batch-1.masked Can you help me with this? Regards,

jebrosen commented 4 years ago

RepeatMasker -gff -e hmm -library hbakeri-families.hmm nHp.2.0.fasta

What version of RepeatMasker are you using? RepeatMasker 4.1.0 does not recognize either -e hmm (it should be -e hmmer) or -library (it should be -lib).

WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac/Software/hmmer-3.3/src/nhmmscan --cut_ga --cpu 2 /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/ hbakeri-families.hmm /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/nHp.2.0.fasta_batch-1.masked

Is this the exact output? There is a space here that looks out of place and might be related to the problem: RM_645603.WedJul221853152020/ hbakeri-families.hmm

Isaac1293 commented 4 years ago

You are right, I am using RepeatMasker 4.1.0. As you suggest, I edited my command line: RepeatMasker -gff -lib hbakeri-families.hmm -e hmmer nHp.2.0.fasta

The exact error is: analyzing file nHp.2.0.fasta identifying Simple Repeats in batch 1 of 13665 identifying matches to hbakeri-families.hmm sequences in batch 1 of 13665 WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac93/anaconda3/envs/py.3.7/bin/nhmmscan --cut_ga --cpu 2 /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/hbakeri-families.hmm /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/nHp.2.0.fasta_batch-1.masked A search phase could not complete on this batch. The batch file will be re-run and if possible the program will resume. WARNING: Retrying batch ( 1 ) [ 255,, 59625]... identifying Simple Repeats in batch 1 of 13665 identifying matches to hbakeri-families.hmm sequences in batch 1 of 13665 WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac93/anaconda3/envs/py.3.7/bin/nhmmscan --cut_ga --cpu 2 /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/hbakeri-families.hmm /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/nHp.2.0.fasta_batch-1.masked A search phase could not complete on this batch. The batch file will be re-run and if possible the program will resume. WARNING: Retrying batch ( 1 ) [ 255,, 59625]... identifying Simple Repeats in batch 1 of 13665 identifying matches to hbakeri-families.hmm sequences in batch 1 of 13665 WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac93/anaconda3/envs/py.3.7/bin/nhmmscan --cut_ga --cpu 2 /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/hbakeri-families.hmm /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/nHp.2.0.fasta_batch-1.masked A search phase could not complete on this batch. The batch file will be re-run and if possible the program will resume.

FATAL ERROR: RepeatMasker giving up. One or more batches failed! Unfortunately this type of error cannot be recovered from. Please submit the following details to the feedback page at the repeatmasker website:

http://www.repeatmasker.org

RepeatMasker Version: 4.1.0 Library Version: HMM-Dfam_3.2 Search Engine: hmmer [ 3.3 (Nov 2019) ] Command Line: /home/isaac93/software/RepeatMasker/RepeatMasker-gff -lib hbakeri-families.hmm -e hmmer nHp.2.0.fasta Batch Number: 1 Disk Space: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/ubuntu-home 528316088 494420376 7035784 99% /home

System Memory: MemTotal: 65853192 kB MemFree: 50249240 kB MemAvailable: 61169188 kB Cached: 9179312 kB SwapCached: 0 kB SwapTotal: 66703356 kB SwapFree: 66703356 kB Further details about this problem may be found in the directory: /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020

As I mentioned above, I had some issues with duplicated sequences in my stk file form RepeatModeler, so I removed the duplicated sequences and edited the SQ count in the header. Here is the heder of my hbakeri-families.hmm file: HMMER3/f [3.3 | Nov 2019] NAME rnd-1_family-572 DESC RepeatModeler Generated - rnd-1_family-572, RepeatScout: [ Index = R=149, RS Size = 159, Refiner Input Size = 100, Final Multiple Alignment Size = 100 ] LENG 781 MAXL 982 ALPH DNA RF yes MM no CONS yes CS no MAP yes DATE Mon Aug 10 10:32:45 2020 NSEQ 115 EFFN 6.931305 CKSUM 613907307 STATS LOCAL MSV -11.6939 0.69612 STATS LOCAL VITERBI -13.2212 0.69612 STATS LOCAL FORWARD -6.2943 0.69612 HMM A C G T
m->m m->i m->d i->m i->i d->m d->d COMPO 1.52761 1.31062 1.40923 1.31319 1.38629 1.38629 1.38629 1.38629 0.27984 3.90934 1.49594 1.46634 0.26236 0.00000 * 1 2.35315 0.38325 2.46305 1.97967 1 c x - - 1.38629 1.38629 1.38629 1.38629 0.05170 3.68121 3.68121 1.46634 0.26236 1.21158 0.35343 2 2.75494 0.24505 2.89443 2.31880 2 c x - - 1.38629 1.38629 1.38629 1.38629 0.04211 3.88167 3.88167 1.46634 0.26236 1.51194 0.24908 3 1.38690 2.03167 2.09523 0.70120 3 t x - -

Thanks for your support Regards

Dfam-consortium / RepeatModeler

RepeatModeler2 duplicate sequences within the same alignment #93