Open Knnnk opened 2 weeks ago
The part I’m still confused about is: when I ran RepeatMasker the first time, it didn’t report any errors, but it also didn’t output any GFF, HTML, or masked FA files. This was the command I used during the first run of RepeatMasker:
RepeatMasker -parallel 16 -e ncbi -html -gff -dir ./ -lib all.lib Penaeus_indicus.fa
。I will attach the output log as a file. Thank you again!
Hi, dear RepeatMasker developers @rmhubley @jebrosen @diekhans ,
I apologize if my previous description was unclear. I am encountering a similar issue once again. I would greatly appreciate any assistance you could provide. I split the P. indicus genome into 12 pieces with similar disk usage, all of which have a bunch of complete sequences of chromosomes and contigs.
$ grep -c ^\> Splited/*.fa
Splited/part_10.fa:333
Splited/part_11.fa:223
Splited/part_12.fa:1860
Splited/part_1.fa:2222
Splited/part_2.fa:2222
Splited/part_3.fa:1778
Splited/part_4.fa:444
Splited/part_5.fa:334
Splited/part_6.fa:444
Splited/part_7.fa:334
Splited/part_8.fa:444
Splited/part_9.fa:333
I ran RepeatMasker for each part separately with this command:
RepeatMasker -parallel 16 -e ncbi -html -gff -dir ./ -lib all.lib Splited/part_1.fa
This time, I found that only the first part (Splited/part_1.fa
) had an issue. The error message in its output was:
"……after many rounds of repeat identifying……
identifying Simple Repeats in batch 2301 of 2304
processing output:
cycle 1 .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... cycle 2 .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... cycle 3 .................................................................................................................................................................................................................................................................................................................................................................................................................................................."
The program ends with only a "part_1.fa.cat.gz" file. So I resun the ProcessRepeats Process again:
ProcessRepeats -lib all.lib -maskSource Splited/part_1.fa -gff part_1.fa.cat.gz
and the error is the same as I mentioned 4 days before:
Illegal division by zero at ~/miniconda3/envs/repeat/bin/ProcessRepeats line 7999.
It seems that I have encountered an issue that I am unable to understand or resolve on my own. I would be sincerely grateful for any assistance.
Best regards, Nik
Can you share your part_1.fa.cat.gz file with me?
I have uploaded it to Google Drive. Please check the link: https://drive.google.com/file/d/18r65XMBZFI4ctUEScO3vWNNGgyxlHzgm/view?usp=sharing
Ok...this exercises a bug in ProcessRepeats that is pretty rare. I have fixed the problem and it will make it into the next release due out in a few days. Unfortunately, in looking through your data I think you have a much bigger problem with your library ("all.lib"). You appear to have some non-unique family identifies in this file. For example:
linear;#DNA/CMC-Chapaev
linear;#DNA/MULE-MuDR
linear;#LINE/Penelope
...
These sequence identifiers are parsed as "family_id#class/subclass". The family_ID in these three cases is "linear;" which is probably causing all sorts of chaos in the adjudication algorithms of ProcessRepeats. Unfortunately, the sanity checking portion of RepeatMasker is not checking for unique family_ids, rather it's considering the complete id ("family_id#class/subclass") when looking for duplicates in your -lib file. So it's not warning you up-front about this issue. You may also want to check that you haven't mixed more than one RepeatModeler library together ( 'rnd-#_family-#' familiy ids) as these auto-generated family names will get re-used with each subsequent run. Finally, I am a bit confused as to why you are getting such strong matches to simple repeats. Take this alignment for example:
459 18.46 1.86 5.81 Pin8879 947 1025 (4052) C linear;#DNA/Crypton-H (5960) 1278 1200 m_b1945s001i316
Pin8879 947 TAGTGTGTGTATGTATATATATATATATATATATATATATATATATATAT 996
i i i i i
C linear;#DNA/C 1278 TAATATATATATATATATATATATATATATATATATATATATATATATAT 1229
Pin8879 997 ATATATATATACATATACATATATATATA 1025
i i
C linear;#DNA/C 1228 ATATATATATATATATATATATATATATA 1200
It looks as if you used the "-nolow" flag, yet I don't see that in your command-line above. Typically RepeatMasker would have identified this region first as a perfect tandem repeat, keeping it from being matched to your "linear;#DNA/Crypton-H" family (in this case). I see in your log that simple repeat searching is being run, yet when I try this on my copy of RepeatMasker I see that TRF finds this and masks it, preventing any matches to anything in my *.lib file. I would like to followup on that. Could you run a simple test for me?
# Create a small sequence file "test.fa" containing the following record
>seq1-Pin8879
ACGTGCGGTAGGACTGATCTAGTCAGTGGCTAGCTGCTGGATGC
TAGTGTGTGTATGTATATATATATATATATATATATATATATATATATAT
ATATATATATATATATATATATATATATAGACGTGCGACGATGTAGCT
AGACTACGTGCGCGCTATCGTGCTGATCATGCTGCTAGCTGATC
# Now run RepeatMasker with your all.lib library
RepeatMasker -e ncbi -html -gff -dir ./ -lib all.lib test.fa
This is what your test.fa.out file should look like:
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
78 0.0 0.0 0.0 seq1-Pin8879 58 123 (63) + (TA)n Simple_repeat 1 66 (0) 1
A single match to a TRF result and not:
459 18.46 1.86 5.81 Pin8879 947 1025 (4052) C linear;#DNA/Crypton-H (5960) 1278 1200
Thank you for your attention and follow-up! Strangely, I didn't use the -nolow flag in my run. And this is the test output:
RepeatMasker version 4.1.5
Search Engine: NCBI/RMBLAST [ 2.14.1+ ]
Using Custom Repeat Library: all.lib
analyzing file test.fa
Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 1
identifying matches to all.lib sequences in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output:
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
cycle 8
cycle 9
cycle 10
Generating output...
masking
done
The test.fa.out file is exactly the TRF matching result:
$ cat test.fa.out
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
78 0.0 0.0 0.0 seq1-Pin8879 58 123 (63) + (TA)n Simple_repeat 1 66 (0) 1
Well, that looks good. Without access to your sequence and library it's hard to reproduce these alignments. My suggestion is to use the latest version of RepeatMasker ( 4.1.7-p1 ), fix the IDs in your library, and run a test sequence ( say 10MB of your genome ). Check the *.alignments file to see if you are getting a ton of false matching to simple repeats to these families that contain small sections of simple repeats. If it looks good, then rerun the full genome.
Thank you very much for your patient guidance and help. I am a bit occupied at the moment, but I will let you know as soon as there are updates.
Hi, I am using RepeatMasker v4.1.5 to mask repeats from a shrimp genome P. indicus, the *.fa.cat.gz was produced but the fa file cannot be masked. So I rerun the
ProcessRepeats
step:ProcessRepeats -lib all.lib -html -gff -maskSource Penaeus_indicus.fa Penaeus_indicus.fa.cat.gz
But the program finished without any .masked, .gff, .html produced, which is confusing. The output message is as below, with many dots omitted.processing output: cycle 1 ........ cycle 2 ........ cycle 3............Illegal division by zero at /gpfshddpool/home/nikuo/miniconda3/envs/repeat/bin/ProcessRepeats line 7999. ..................................................................
Environment (please include as much of the following information as you can find out):
How did you install RepeatMasker?
bioconda
The output of
RepeatMasker -v
RepeatMasker version 4.1.5
uname -a
andlsb_release -a
can be used to find this.Linux mgt01 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Thanks in advance for your help!
Best Regards, Nik