Closed huanananan closed 1 year ago
*have the same bug,when masking:
- fungi
- Aureobasidium_melanogenum_cbs_110374_gca_000721775.Aureobasidium_pullulans_var._melanogenum_CBS_110374_v1.0.dna.toplevel.fa
- fungi
- Aureobasidium_pullulans_exf_150_gca_000721785.Aureobasidium_pullulans_var._pullulans_EXF-150_assembly_version_1.0.dna.toplevel.fa
*there is another bug protists Plasmodium_falciparum_palo_alto_uganda_gca_000521095.Plas_falc_Uganda_Palo-Alto_FUP_H_V1.dna.toplevel.fa
**the log file
main::main::postProcessSearch: FastaDB::substr - Error index out of bounds! (SeqID=, offset=56306, length=54 actualSeqLen=0) at /data_group/software/RepeatMasker/RepeatMasker line 6038.
Attempting to mask from 56306 to 56360 ( len = 54 ) at /data_group/software/RepeatMasker/RepeatMasker line 6039. main::postProcessSearch('HASH(0x153c948)', 'SearchResultCollection=HASH(0x1866718)', 'HASH(0x18ab8e8)', 0 , 1, 'FastaDB=HASH(0x1892ef0)', undef, '/data_group/project/homology_new_m...', 'HASH( 0x1455760)', ...) called at /data_group/software/RepeatMasker/RepeatMasker line 2777 main::runTRFStage('HASH(0x153c948)', 'identifying Simple Repeats', 'batch 101 of 440', 'DIVERGED', '', '/ data_group/project/homology_new_m...', '/data_group/project /homology_new_m...', 'NCBIBlastSearchEngine=HASH(0x16c2b08)', 101, ...) called at/data_group/software/RepeatMasker/RepeatMasker line 4595 main::runSearchStages('HASH(0x141fdd8)', '/data_group/software/RepeatMasker', 20, '/data_group/project/homology_new_m...', '/data_group/p roject/homology_new_m...', '/data_group/software/RepeatMasker/...', '/data_group/ software/RepeatMasker/...', 440, 'NCBIBlastSearchEngine=HASH(0x16c2b08)', ...) called at /d ata_group/software/RepeatMasker/RepeatMasker line 1102 main::main::postProcessSearch: FastaDB::substr - Error index out of bounds!
That first issue appears to be a problem with the TRF program. I am trying to reproduce it so that I can see if there is something we can do to fix TRF or work around it. Could you send me the RepeatMasker command line you used in the second issue or let me know if it was the same general parameters at the previous issue. Also, I should point out that if you are only interested in simple repeat masking, then you do not need to specify a species. The use of "Eukaryota" as a species is going to extract virtually all TE families (save 9) from the famdb database and into your catch area so beware that is both unnecessary to a -noint search and going to be quite large.
I could not reproduce the issue you are having with the "buffer overflow detected" error from TRF. This appears to be unique to your software/hardware configuration. I am suspicious that it might be related to the second issue you reported. I would be curious if you still see this after shortening the input filename as described below.
I was able to reproduce the second error message you reported "main::main::postProcessSearch: FastaDB::substr - Error index out of bounds!" which came from RepeatMasker, although I traced it back to TRF as well. TRF appears to have a problem with the long filenames you are using. I get the same error as you do for Plasmodium falciparum if I use the filename:
Plasmodium_falciparum_palo_alto_uganda_gca_000521095.Plas_falc_Uganda_Palo-Alto_FUP_H_V1.dna.toplevel.fa
However it completes just fine if I rename that to 'foo.fa'. When I tracked it down it appears this is a problem with TRF. For some unknown reason it doesn't write a complete output HTML file in some cases with the long names. For instance I saw many TRF output files like:
<HTML><HEAD><TITLE>Plasmodium_falciparum_palo_alto_uganda_gca_000521095.Plas_falc_Uganda_Palo-Alto_FUP_H_V1.dna.toplevel.fa_batch-115.masked.2.3.
5.75.20.33.7.txt.html</TITLE></HEAD><BODY bgcolor="#FBFile 2 of 2
Found at i:45876 original size:2 final size:2
<A NAME="45777--46492,2,366.5,2,811"></A><A HREF="http://tandem.bu.edu/trf/trf.definitions.html#alignment" target ="explanation">Alignment explan
ation</A><BR><BR>
Indices: 45777--46492 Score: 150
Period size: 2 Copynumber: 366.5 Consensus size: 2
45767 AAAATGAAAC
* ** * * * * * *
45777 AT TT AT GC AT AT A- AT AC AG AT GT AT CT A- AT A- AT TT GT AT
1 AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT AT
Which is clearly missing the header normally included by TRF:
<HTML><HEAD><TITLE>Plasmodium_falciparum_palo_alto_uganda_gca_000521095.Plas_falc_Uganda_Palo-Alto_FUP_H_V1.dna.toplevel.fa_batch-20.masked.2.7.7
.80.10.50.10.txt.html</TITLE></HEAD><BODY bgcolor="#FBF8BC"><PRE>
Tandem Repeats Finder Program written by:
Gary Benson
Program in Bioinformatics
Boston University
Version 4.09
Sequence: KI927384frag-20 dna:supercontig supercontig:Plas_falc_Uganda_Palo-Alto_FUP_H_V1:KI927384:1:3543875:1 REF
Parameters: 2 7 7 80 10 50 10
Pmatch=0.80,Pindel=0.10
tuple sizes 0,4,5,7
tuple distances 0, 29, 159, 500
Length: 59127
ACGTcount: A:0.38, C:0.07, G:0.08, T:0.33
The important part that is missing in the output file is the "Sequence: " line, which tells RepeatMasker which sequence the results are applied to. Without this information RepeatMasker attempted to mask/cut annotations against an empty sequence identifier and gave an error. Try aliasing or shorting your filenames before processing with RepeatMasker. In future versions of RM I may switch to using internally-generated names for temporary files to avoid this problem but for now this is a quick fix.
It's working now. Perhaps no one use so many species and genome like me.
Thanks for your reply, for those genome, I use tantan to be a substitution.
All the command line I used are same, except the "genome".
Thanks again for your advise, I mistakenly thought "-species" was required for every search.
The first issue and second one are both have same solution, JUST SIMPLiFY THE FILE NAME, and then everything work well.
Describe the issue
I mask the simple repeat in some genome. I have tried to use larger memory and less thread, but not helpful.
Reproduction steps
RepeatMasker -xsmall -noint -no_is -e rmblast -pa 10 -species Eukaryota -dir mask_the_simple_repeat/output/fungi -gff Aureobasidium_melanogenum_cbs_110374_gca_000721775.Aureobasidium_pullulans_var._melanogenum_CBS_110374_v1.0.dna.toplevel.fa
the genome file download from ensembl fungi, ver 54
Log output
buffer overflow detected : /data_group/software/trf409.linux64 terminated identifying Simple Repeats in batch 3 of 477 ======= Backtrace: ========= /lib64/libc.so.6(fortify_fail+0x37)[0x7fd043e6a697] /lib64/libc.so.6(+0x116812)[0x7fd043e68812] /data_group/software/trf409.linux64[0x401417] /lib64/libc.so.6(libc_start_main+0xf5)[0x7fd043d74555] /data_group/software/trf409.linux64[0x401619] ======= Memory map: ======== 00400000-00419000 r-xp 00000000 00:c6 3277556016 /data_group/software/trf409.linux64 00618000-00619000 r--p 00018000 00:c6 3277556016 /data_group/software/trf409.linux64 00619000-0061a000 rw-p 00019000 00:c6 3277556016 /data_group/software/trf409.linux64 0061a000-0063c000 rw-p 00000000 00:00 0 02388000-023a9000 rw-p 00000000 00:00 0 [heap] 7fd043b3c000-7fd043b51000 r-xp 00000000 08:02 324127956 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fd043b51000-7fd043d50000 ---p 00015000 08:02 324127956 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fd043d50000-7fd043d51000 r--p 00014000 08:02 324127956 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fd043d51000-7fd043d52000 rw-p 00015000 08:02 324127956 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7fd043d52000-7fd043f16000 r-xp 00000000 08:02 324280176 /usr/lib64/libc-2.17.so 7fd043f16000-7fd044115000 ---p 001c4000 08:02 324280176 /usr/lib64/libc-2.17.so 7fd044115000-7fd044119000 r--p 001c3000 08:02 324280176 /usr/lib64/libc-2.17.so 7fd044119000-7fd04411b000 rw-p 001c7000 08:02 324280176 /usr/lib64/libc-2.17.so 7fd04411b000-7fd044120000 rw-p 00000000 00:00 0 7fd044120000-7fd044221000 r-xp 00000000 08:02 324286336 /usr/lib64/libm-2.17.so 7fd044221000-7fd044420000 ---p 00101000 08:02 324286336 /usr/lib64/libm-2.17.so 7fd044420000-7fd044421000 r--p 00100000 08:02 324286336 /usr/lib64/libm-2.17.so 7fd044421000-7fd044422000 rw-p 00101000 08:02 324286336 /usr/lib64/libm-2.17.so 7fd044422000-7fd044444000 r-xp 00000000 08:02 324127969 /usr/lib64/ld-2.17.so 7fd04462e000-7fd044631000 rw-p 00000000 00:00 0 7fd044641000-7fd044643000 rw-p 00000000 00:00 0 7fd044643000-7fd044644000 r--p 00021000 08:02 324127969 /usr/lib64/ld-2.17.so 7fd044644000-7fd044645000 rw-p 00022000 08:02 324127969 /usr/lib64/ld-2.17.so 7fd044645000-7fd044646000 rw-p 00000000 00:00 0 7ffe5332f000-7ffe53354000 rw-p 00000000 00:00 0 [stack] 7ffe533d6000-7ffe533da000 r--p 00000000 00:00 0 [vvar] 7ffe533da000-7ffe533dc000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
sh: line 1: 1038937 Aborted (core dumped) /data_group/software/trf409.linux64 /data_group/project/homology_new_method/mask_the_simple_repeat/cache/RM_1037768.SunDec250134042022/Aureobasidium_melanogenum_cbs_110374_gca_000721775.Aureobasidium_pullulans_var._melanogenum_CBS_110374_v1.0.dna.toplevel.fa_batch-18.masked 2 7 7 80 10 50 10 2> trfResults-1671903273-1038897.err
Environment (please include as much of the following information as you can find out):
from repeatmasker.org
RepeatMasker -v
can be used to find this.RepeatMasker version 4.1.4
Use the Dfam contained in RepeatMasker
uname -a
andlsb_release -a
can be used to find this.Linux gatk1 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Additional context
Most of the genome is normal, but a few have such errors. And this is the case in vertebrates, plants, etc.