Closed FredericBGA closed 2 months ago
Hello
I think that the problem is an issue of TRF with my file...
Indeed there more than 120 repeats but the second file (filename.2.txt.html
) does not hold Sequence:
batch-2.masked.2.3.5.75.20.33.7.txt.html Alignment explanation
Indices: 58510--58569 Score: 50 Period size: 5 Copynumber: 12.0 Consensus size: 5 58500 ATACTGGGTA
A normal file should look like:
batch-2.masked.2.3.5.75.20.33.7.txt.html Tandem Repeats Finder Program written by: Gary Benson Program in Bioinformatics Boston University Version 4.09 Sequence: NODE_9_length_175571_cov_32.628428frag-2 CHUNK number:0 size:175571 offset:0 Parameters: 2 3 5 75 20 33 7
These are both problems, but at a first glance they seem unlikely to be related to each other.
RepeatMasker (4.0.9) ends with FATAL ERROR: RepeatMasker giving up.
Can you post the complete log output?
It looks like that in my particular case (problem with my entry file? see my comment below) the second file does not hold any Sequence: string. So this line is never reached get sequence id in the TRF parser.
I appreciate that you have tried to troubleshoot this further. However, this seems unlikely to me because it does not look like RepeatMasker is aware that a second file is sometimes generated. RepeatMasker only reads the first results file, so I would expect it to simply be missing some of the TRF results in this situation. (cc @rmhubley - I assume this parsing code was written before TRF added nicer output formats e.g. -d
/-h
/-ngs
)
Then the call to getQueryName() will retrieve an empty value. And the program will end with Error index out of bounds! error.
Which particular call to getQueryName()
are you referring to? Did you find this in the log output or through additional debugging you performed?
Hi, thank you for your reply, I really appreciate.
The getQueryName()
involved is located line 5924 (I've added a permalink in my comment).
I will try to see if I can share the file I'm trying to mask tomorrow.
I've add print statements (a lot!). And I'm sure that the second file is parsed... I've the TRF command line, and when I launch it I retrieve two HTML file (xxxx.1.txt.html and xxxx.2.txt.html depending on the input file name)
identifying Simple Repeats in batch 2 of 3 FRED TRF COMMAND LINE IS =/softs/bioinfo/others_bin/trf-4.0.9 /gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021/baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked 2 3 5 75 20 33 7 = FRED HERE FILE IS =/gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021//baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.1.txt.html= in parseResult TRF.pm qryID=NODE_9_length_175571_cov_32.628428frag-2=
[cut here]
FRED HERE FILE IS =/gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021//baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.2.txt.html= in parseResult TRF.pm qryID== ERROR qryID is not defined! FRED HERE FILE IS =/gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021//baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.2.txt.html
You can see that the file name is quite strange, maybe I've a glitch in the file, because aaaa[]
works and baaaa[]
fails... And the second file is a copy of the first one... And diff
command returns nothing.
The getQueryName() involved is located line 5924 (I've added a permalink in my comment).
Line 5924 is in the function getSegments()
, which as far as I can tell has been unused since the first commit on GitHub back in 2017. I don't see how this is possible.
And I'm sure that the second file is parsed...
My apologies! I misread this part of the code, and I see where this happens now. I also think I see the bug, although I haven't looked at a multi-file TRF output to be sure. Each call to parseOutput
resets qryID = ""
, but for files other than the first I assume it should be reusing the last qryID
seen in the previous file.
I expect that the best way to fix this properly is to parse a format other than the HTML output, but a less invasive fix would be to preserve the previous qryID
across calls to parseOutput
.
You can see that the file name is quite strange, maybe I've a glitch in the file, because
aaaa[]
works andbaaaa[]
fails...
Now I'm a bit lost again. Are you really only changing the name of the input file to RepeatMasker
and that consistently makes a difference? Is it the entire RepeatMasker
run that "works" or "fails", or something else? How do the TRF result files compare between the two runs, and in particular does the working run have a Sequence:
line early in the second file that is missing in the failed run?
My apologies! The getQueryName()
involved is line 5974, in postProcessSearch
(here)
I've just created a git repository. https://github.com/FredericBGA/trf-repeatmasker (when I clone the repo in another location, I can reproduce the behavior) I've seen with my colleague, I can share the involved fasta file.
The README holds my command lines.
I launch TRF on the file with the long name. The output is weird as the second file baaaa[]aa.2.txt.html
does not hold the Sequence:
.
I copy the file with the long name.
I launch TRF on this new file. The output seems good as the second file batch-2.masked.2.3.5.75.20.33.7.2.txt.html
holds the Sequence:
I'm really sorry if the issue is located between my chair and my keyboard, but this let me stuck and I would like to understand! Thank you for your help and ideas.
I've just created a git repository.
I think this sample input/output cleared up all of my questions; thanks so much for reporting this and for helping track down the cause.
I'm really sorry if the issue is located between my chair and my keyboard, but this let me stuck and I would like to understand! Thank you for your help and ideas.
I've reproduced this, and I'm confident it's a bug/deficiency in TRF, not in RepeatMasker itself nor user error.
Here is something that jumped out at me from one of the "1 of 2" files:
<HTML><HEAD><TITLE>baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY bgcolor="#FBFile 1 of 2
F8BC"><PRE>
<BODY bgcolor="#FBFile 1 of 2<newline><newline>F8BC">
is clearly wrong. I did some sleuthing, and found a hardcoded line length of 200 in https://github.com/Benson-Genomics-Lab/TRF/blob/master/src/trfclean.h#L569. The long filename causes some lines to be longer than that, and their contents are pushed into other lines and/or skipped when reassembling the complete output later.
The result of this bug is that RepeatMasker will fail on any file whose name is too long (I estimate 96 characters), and where any batch has more than 120 repeats identified by TRF.
For a workaround, I would suggest running RepeatMasker
again with a shorter input filename -- hopefully this is the only problem that affects your input file! If that does work, I propose that RepeatMasker use shorter names for some of these intermediate files. Shorter names are both more convenient to work with when troubleshooting and less likely to expose bugs such as these in other programs now and in the future.
very nice! I'm happy to see that the issue is reproducible. unfortunately I can not easily managed the length of the input filenames (as RepeatMasker is launche in this particular case by another tool: Maker I will open and describe the issue to the TRF repo! many thanks!
Sorry for the absurdly late response. In an attempt to cleanup the open issues this issue has been closed. Please submit a new issue if this is a still a current problem.
Hi,
RepeatMasker (
4.0.9
) ends withFATAL ERROR: RepeatMasker giving up.
I've launched RepeatMasker on a file that holds a larger number of repeats.
I've noticed that the TRF
4.0.9
command line ie:trf-4.0.9 a_batch-2.masked 2 3 5 75 20 33 7
will produced two output files. This is documented:If the number of repeats found is greater than 120, multiple linked repeat tables are produced
. TRF documentation.It looks like that in my particular case (problem with my entry file? see my comment below) the second file does not hold any
Sequence:
string. So this line is never reached get sequence id in the TRF parser.Then the call to getQueryName() will retrieve an empty value. And the program will end with
Error index out of bounds!
error.Can you reproduce this? (I'm afraid that I won't be able to share the input file as it's private data). How can we fix this?