Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

Illegal division by zero at ProcessRepeats line 7999. #275

Open Knnnk opened 2 weeks ago

Knnnk commented 2 weeks ago

Hi, I am using RepeatMasker v4.1.5 to mask repeats from a shrimp genome P. indicus, the *.fa.cat.gz was produced but the fa file cannot be masked. So I rerun the ProcessRepeats step: ProcessRepeats -lib all.lib -html -gff -maskSource Penaeus_indicus.fa Penaeus_indicus.fa.cat.gz But the program finished without any .masked, .gff, .html produced, which is confusing. The output message is as below, with many dots omitted. processing output: cycle 1 ........ cycle 2 ........ cycle 3............Illegal division by zero at /gpfshddpool/home/nikuo/miniconda3/envs/repeat/bin/ProcessRepeats line 7999. ..................................................................

Environment (please include as much of the following information as you can find out):

RepeatMasker version 4.1.5

Linux mgt01 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance for your help!

Best Regards, Nik

Knnnk commented 2 weeks ago

The part I’m still confused about is: when I ran RepeatMasker the first time, it didn’t report any errors, but it also didn’t output any GFF, HTML, or masked FA files. This was the command I used during the first run of RepeatMasker: RepeatMasker -parallel 16 -e ncbi -html -gff -dir ./ -lib all.lib Penaeus_indicus.fa。I will attach the output log as a file. Thank you again!

Knnnk commented 2 weeks ago

out.60584.log

Knnnk commented 2 weeks ago

Hi, dear RepeatMasker developers @rmhubley @jebrosen @diekhans ,

I apologize if my previous description was unclear. I am encountering a similar issue once again. I would greatly appreciate any assistance you could provide. I split the P. indicus genome into 12 pieces with similar disk usage, all of which have a bunch of complete sequences of chromosomes and contigs.

$ grep -c ^\> Splited/*.fa
Splited/part_10.fa:333
Splited/part_11.fa:223
Splited/part_12.fa:1860
Splited/part_1.fa:2222
Splited/part_2.fa:2222
Splited/part_3.fa:1778
Splited/part_4.fa:444
Splited/part_5.fa:334
Splited/part_6.fa:444
Splited/part_7.fa:334
Splited/part_8.fa:444
Splited/part_9.fa:333

I ran RepeatMasker for each part separately with this command:

RepeatMasker -parallel 16 -e ncbi -html -gff -dir ./ -lib all.lib Splited/part_1.fa

This time, I found that only the first part (Splited/part_1.fa) had an issue. The error message in its output was:

"……after many rounds of repeat identifying……

identifying Simple Repeats in batch 2301 of 2304

processing output:

cycle 1 .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... cycle 2 .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... cycle 3 .................................................................................................................................................................................................................................................................................................................................................................................................................................................."

The program ends with only a "part_1.fa.cat.gz" file. So I resun the ProcessRepeats Process again:

ProcessRepeats -lib all.lib -maskSource Splited/part_1.fa -gff part_1.fa.cat.gz

and the error is the same as I mentioned 4 days before:

Illegal division by zero at ~/miniconda3/envs/repeat/bin/ProcessRepeats line 7999.

It seems that I have encountered an issue that I am unable to understand or resolve on my own. I would be sincerely grateful for any assistance.

Best regards, Nik

rmhubley commented 1 week ago

Can you share your part_1.fa.cat.gz file with me?

Knnnk commented 1 week ago

I have uploaded it to Google Drive. Please check the link: https://drive.google.com/file/d/18r65XMBZFI4ctUEScO3vWNNGgyxlHzgm/view?usp=sharing

rmhubley commented 1 week ago

Ok...this exercises a bug in ProcessRepeats that is pretty rare. I have fixed the problem and it will make it into the next release due out in a few days. Unfortunately, in looking through your data I think you have a much bigger problem with your library ("all.lib"). You appear to have some non-unique family identifies in this file. For example:

linear;#DNA/CMC-Chapaev
linear;#DNA/MULE-MuDR
linear;#LINE/Penelope
...

These sequence identifiers are parsed as "family_id#class/subclass". The family_ID in these three cases is "linear;" which is probably causing all sorts of chaos in the adjudication algorithms of ProcessRepeats. Unfortunately, the sanity checking portion of RepeatMasker is not checking for unique family_ids, rather it's considering the complete id ("family_id#class/subclass") when looking for duplicates in your -lib file. So it's not warning you up-front about this issue. You may also want to check that you haven't mixed more than one RepeatModeler library together ( 'rnd-#_family-#' familiy ids) as these auto-generated family names will get re-used with each subsequent run. Finally, I am a bit confused as to why you are getting such strong matches to simple repeats. Take this alignment for example:

459 18.46 1.86 5.81 Pin8879 947 1025 (4052) C linear;#DNA/Crypton-H (5960) 1278 1200 m_b1945s001i316

  Pin8879              947 TAGTGTGTGTATGTATATATATATATATATATATATATATATATATATAT 996
                             i i i i   i                                     
C linear;#DNA/C       1278 TAATATATATATATATATATATATATATATATATATATATATATATATAT 1229

  Pin8879              997 ATATATATATACATATACATATATATATA 1025
                                      i     i           
C linear;#DNA/C       1228 ATATATATATATATATATATATATATATA 1200

It looks as if you used the "-nolow" flag, yet I don't see that in your command-line above. Typically RepeatMasker would have identified this region first as a perfect tandem repeat, keeping it from being matched to your "linear;#DNA/Crypton-H" family (in this case). I see in your log that simple repeat searching is being run, yet when I try this on my copy of RepeatMasker I see that TRF finds this and masks it, preventing any matches to anything in my *.lib file. I would like to followup on that. Could you run a simple test for me?

# Create a small sequence file "test.fa" containing the following record
>seq1-Pin8879
ACGTGCGGTAGGACTGATCTAGTCAGTGGCTAGCTGCTGGATGC
TAGTGTGTGTATGTATATATATATATATATATATATATATATATATATAT
ATATATATATATATATATATATATATATAGACGTGCGACGATGTAGCT
AGACTACGTGCGCGCTATCGTGCTGATCATGCTGCTAGCTGATC

# Now run RepeatMasker with your all.lib library
RepeatMasker -e ncbi -html -gff -dir ./ -lib all.lib test.fa

This is what your test.fa.out file should look like:

   SW   perc perc perc  query         position in query    matching repeat          position in repeat
score   div. del. ins.  sequence      begin end   (left)   repeat   class/family  begin  end    (left)  ID

   78    0.0  0.0  0.0  seq1-Pin8879     58   123   (63) + (TA)n    Simple_repeat      1     66    (0)   1  

A single match to a TRF result and not:

459 18.46 1.86 5.81 Pin8879 947 1025 (4052) C linear;#DNA/Crypton-H (5960) 1278 1200
Knnnk commented 1 week ago

Thank you for your attention and follow-up! Strangely, I didn't use the -nolow flag in my run. And this is the test output:

RepeatMasker version 4.1.5
Search Engine: NCBI/RMBLAST [ 2.14.1+ ]
Using Custom Repeat Library: all.lib

analyzing file test.fa

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 1
identifying matches to all.lib sequences in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output: 
cycle 1 
cycle 2 
cycle 3 
cycle 4 
cycle 5 
cycle 6 
cycle 7 
cycle 8 
cycle 9 
cycle 10 
Generating output... 
masking
done

The test.fa.out file is exactly the TRF matching result:

$ cat test.fa.out
   SW   perc perc perc  query         position in query    matching repeat          position in repeat
score   div. del. ins.  sequence      begin end   (left)   repeat   class/family  begin  end    (left)  ID

   78    0.0  0.0  0.0  seq1-Pin8879     58   123   (63) + (TA)n    Simple_repeat      1     66    (0)   1  
rmhubley commented 1 week ago

Well, that looks good. Without access to your sequence and library it's hard to reproduce these alignments. My suggestion is to use the latest version of RepeatMasker ( 4.1.7-p1 ), fix the IDs in your library, and run a test sequence ( say 10MB of your genome ). Check the *.alignments file to see if you are getting a ton of false matching to simple repeats to these families that contain small sections of simple repeats. If it looks good, then rerun the full genome.

Knnnk commented 5 days ago

Thank you very much for your patient guidance and help. I am a bit occupied at the moment, but I will let you know as soon as there are updates.