Open Isaac1293 opened 4 years ago
Thanks for the report and the data. I have verified that the duplicate was generated by the RepeatScout portion of the pipeline and I am looking into why that happened. Would you be able to make available the nHp.2.0.fasta file and random number seed printed out near the top of your run.out file? The line would look something like this: "Random Number Seed: 1595354491"
The random number seed is:
Random Number Seed: 1580943197
the fasta file is bigger than 10 Mb even as tar.gz file. Is there another way to share it with you ?
It could also be downloaded as :
wget ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/current/species/heligmosomoides_polygyrus/PRJEB15396/heligmosomoides_polygyrus.PRJEB15396.WBPS14.genomic.fa.gz
Thanks for the files! This was a difficult one to track down. I don't expect this to happen very often. It's the result of the way representative sequences are generated for RepeatScout consensi. Unfortunately RepeatScout in its current form does not provide the sequence ranges it used to generate a given family consensus. So RepeatModeler has use the consensus to go back and find examples in the genomic sample used by RepeatScout. There is a chance that you can retrieve two alternative overlapping alignments in the initial stages of refinement, only to end up with two identical subregions remaining at the end of the refinement steps. We are working on a new version of RepeatScout with lots of changes including the ability to generate the sequence range list and in the long term this problem will go away. For now I encoded a check in our Refiner script to detect these duplicates and remove them before generating the Stockholm file. You could also delete this identical sequence yourself from that one file ( and adjust the "SQ" sequence count in the header ) and be just fine. Once I have done some more testing I will check in the RepeatModeler changes to github and push out a release.
Thank you for your reply,
As you suggested, I removed the duplicate sequences an edited the SQ line. However, so far I can't run repeatmasker with the hmm file. I have been using:
hmmbuild hbakeri-families.hmm hbakeri-families.stk
then
RepeatMasker -gff -e hmm -library hbakeri-families.hmm nHp.2.0.fasta
But I get " WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac/Software/hmmer-3.3/src/nhmmscan --cut_ga --cpu 2 /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/ hbakeri-families.hmm /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/nHp.2.0.fasta_batch-1.masked Can you help me with this? Regards,
RepeatMasker -gff -e hmm -library hbakeri-families.hmm nHp.2.0.fasta
What version of RepeatMasker are you using? RepeatMasker 4.1.0 does not recognize either -e hmm
(it should be -e hmmer
) or -library
(it should be -lib
).
WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac/Software/hmmer-3.3/src/nhmmscan --cut_ga --cpu 2 /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/ hbakeri-families.hmm /home/isaac/Documentos/Annotation_Files/Hb_annotation_V2/Repmask_custom/RepeatModler/RM_645603.WedJul221853152020/nHp.2.0.fasta_batch-1.masked
Is this the exact output? There is a space here that looks out of place and might be related to the problem: RM_645603.WedJul221853152020/ hbakeri-families.hmm
You are right, I am using RepeatMasker 4.1.0.
As you suggest, I edited my command line:
RepeatMasker -gff -lib hbakeri-families.hmm -e hmmer nHp.2.0.fasta
The exact error is: analyzing file nHp.2.0.fasta identifying Simple Repeats in batch 1 of 13665 identifying matches to hbakeri-families.hmm sequences in batch 1 of 13665 WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac93/anaconda3/envs/py.3.7/bin/nhmmscan --cut_ga --cpu 2 /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/hbakeri-families.hmm /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/nHp.2.0.fasta_batch-1.masked A search phase could not complete on this batch. The batch file will be re-run and if possible the program will resume. WARNING: Retrying batch ( 1 ) [ 255,, 59625]... identifying Simple Repeats in batch 1 of 13665 identifying matches to hbakeri-families.hmm sequences in batch 1 of 13665 WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac93/anaconda3/envs/py.3.7/bin/nhmmscan --cut_ga --cpu 2 /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/hbakeri-families.hmm /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/nHp.2.0.fasta_batch-1.masked A search phase could not complete on this batch. The batch file will be re-run and if possible the program will resume. WARNING: Retrying batch ( 1 ) [ 255,, 59625]... identifying Simple Repeats in batch 1 of 13665 identifying matches to hbakeri-families.hmm sequences in batch 1 of 13665 WARNING: The search engine returned an error (1, status = 1 ) Engine parameters: /home/isaac93/anaconda3/envs/py.3.7/bin/nhmmscan --cut_ga --cpu 2 /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/hbakeri-families.hmm /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020/nHp.2.0.fasta_batch-1.masked A search phase could not complete on this batch. The batch file will be re-run and if possible the program will resume.
FATAL ERROR: RepeatMasker giving up. One or more batches failed! Unfortunately this type of error cannot be recovered from. Please submit the following details to the feedback page at the repeatmasker website:
RepeatMasker Version: 4.1.0 Library Version: HMM-Dfam_3.2 Search Engine: hmmer [ 3.3 (Nov 2019) ] Command Line: /home/isaac93/software/RepeatMasker/RepeatMasker-gff -lib hbakeri-families.hmm -e hmmer nHp.2.0.fasta Batch Number: 1 Disk Space: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/ubuntu-home 528316088 494420376 7035784 99% /home
System Memory: MemTotal: 65853192 kB MemFree: 50249240 kB MemAvailable: 61169188 kB Cached: 9179312 kB SwapCached: 0 kB SwapTotal: 66703356 kB SwapFree: 66703356 kB Further details about this problem may be found in the directory: /home/isaac93/Repeatmasker-test/RM_193460.MonAug101112252020
As I mentioned above, I had some issues with duplicated sequences in my stk file form RepeatModeler, so I removed the duplicated sequences and edited the SQ count in the header. Here is the heder of my hbakeri-families.hmm file:
HMMER3/f [3.3 | Nov 2019]
NAME rnd-1_family-572
DESC RepeatModeler Generated - rnd-1_family-572, RepeatScout: [ Index = R=149, RS Size = 159, Refiner Input Size = 100, Final Multiple Alignment Size = 100 ]
LENG 781
MAXL 982
ALPH DNA
RF yes
MM no
CONS yes
CS no
MAP yes
DATE Mon Aug 10 10:32:45 2020
NSEQ 115
EFFN 6.931305
CKSUM 613907307
STATS LOCAL MSV -11.6939 0.69612
STATS LOCAL VITERBI -13.2212 0.69612
STATS LOCAL FORWARD -6.2943 0.69612
HMM A C G T
m->m m->i m->d i->m i->i d->m d->d
COMPO 1.52761 1.31062 1.40923 1.31319
1.38629 1.38629 1.38629 1.38629
0.27984 3.90934 1.49594 1.46634 0.26236 0.00000 *
1 2.35315 0.38325 2.46305 1.97967 1 c x - -
1.38629 1.38629 1.38629 1.38629
0.05170 3.68121 3.68121 1.46634 0.26236 1.21158 0.35343
2 2.75494 0.24505 2.89443 2.31880 2 c x - -
1.38629 1.38629 1.38629 1.38629
0.04211 3.88167 3.88167 1.46634 0.26236 1.51194 0.24908
3 1.38690 2.03167 2.09523 0.70120 3 t x - -
Thanks for your support Regards
Hi, I was trying to convert the stk output file in a hmm, however, I got an interesting result with hmmbuild. It seems like there are duplicated sequences within the same alignment (coordinates), these means that hmmer could overestimate the probabilities of certain bases in certain positions.
BuildDatabase -name hbakeri nHp.2.0.fasta
RepeatModeler -database hbakeri -pa 2 -LTRStruct >& run.out
The error in hmmbuild is:
Alignment input parse error: duplicate seq name nHp.2.0.scaf00731:62157-62226 while reading Stockholm file hbakeri-families.stk at or near line 11855
Using sed in found that the first and the last sequences in the output are duplicated:
sed '11836,11855!d' hbakeri-families.stk
The ID is nHp.2.0.scaf00731:62157-62226 and here is the stk file hbakeri-families.stk.tar.gz