Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
231 stars 49 forks source link

TRF results file is malformed when input filename is long and more than 120 repeats are found in a batch #109

Closed FredericBGA closed 2 months ago

FredericBGA commented 3 years ago

Hi,

RepeatMasker (4.0.9) ends with FATAL ERROR: RepeatMasker giving up.

I've launched RepeatMasker on a file that holds a larger number of repeats.

I've noticed that the TRF 4.0.9 command line ie: trf-4.0.9 a_batch-2.masked 2 3 5 75 20 33 7 will produced two output files. This is documented: If the number of repeats found is greater than 120, multiple linked repeat tables are produced. TRF documentation.

It looks like that in my particular case (problem with my entry file? see my comment below) the second file does not hold any Sequence: string. So this line is never reached get sequence id in the TRF parser.

Then the call to getQueryName() will retrieve an empty value. And the program will end with Error index out of bounds! error.

Can you reproduce this? (I'm afraid that I won't be able to share the input file as it's private data). How can we fix this?

FredericBGA commented 3 years ago

Hello

I think that the problem is an issue of TRF with my file... Indeed there more than 120 repeats but the second file (filename.2.txt.html) does not hold Sequence:

batch-2.masked.2.3.5.75.20.33.7.txt.htmlAlignment explanation

Indices: 58510--58569 Score: 50 Period size: 5 Copynumber: 12.0 Consensus size: 5 58500 ATACTGGGTA

A normal file should look like:

batch-2.masked.2.3.5.75.20.33.7.txt.html
Tandem Repeats Finder Program written by:

                 Gary Benson
      Program in Bioinformatics
          Boston University

Version 4.09

Sequence: NODE_9_length_175571_cov_32.628428frag-2 CHUNK number:0 size:175571 offset:0

Parameters: 2 3 5 75 20 33 7
jebrosen commented 3 years ago

These are both problems, but at a first glance they seem unlikely to be related to each other.

RepeatMasker (4.0.9) ends with FATAL ERROR: RepeatMasker giving up.

Can you post the complete log output?


It looks like that in my particular case (problem with my entry file? see my comment below) the second file does not hold any Sequence: string. So this line is never reached get sequence id in the TRF parser.

I appreciate that you have tried to troubleshoot this further. However, this seems unlikely to me because it does not look like RepeatMasker is aware that a second file is sometimes generated. RepeatMasker only reads the first results file, so I would expect it to simply be missing some of the TRF results in this situation. (cc @rmhubley - I assume this parsing code was written before TRF added nicer output formats e.g. -d/-h/-ngs)

Then the call to getQueryName() will retrieve an empty value. And the program will end with Error index out of bounds! error.

Which particular call to getQueryName() are you referring to? Did you find this in the log output or through additional debugging you performed?

FredericBGA commented 3 years ago

Hi, thank you for your reply, I really appreciate.

The getQueryName() involved is located line 5924 (I've added a permalink in my comment).

I will try to see if I can share the file I'm trying to mask tomorrow.

I've add print statements (a lot!). And I'm sure that the second file is parsed... I've the TRF command line, and when I launch it I retrieve two HTML file (xxxx.1.txt.html and xxxx.2.txt.html depending on the input file name)

identifying Simple Repeats in batch 2 of 3 FRED TRF COMMAND LINE IS =/softs/bioinfo/others_bin/trf-4.0.9 /gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021/baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked 2 3 5 75 20 33 7 = FRED HERE FILE IS =/gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021//baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.1.txt.html= in parseResult TRF.pm qryID=NODE_9_length_175571_cov_32.628428frag-2=

[cut here]

FRED HERE FILE IS =/gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021//baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.2.txt.html= in parseResult TRF.pm qryID== ERROR qryID is not defined! FRED HERE FILE IS =/gpfs/workspace/tmp/fsapet/testRepeatMasker/RM_6884.TueMay181437362021//baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.2.txt.html

You can see that the file name is quite strange, maybe I've a glitch in the file, because aaaa[] works and baaaa[] fails... And the second file is a copy of the first one... And diff command returns nothing.

jebrosen commented 3 years ago

The getQueryName() involved is located line 5924 (I've added a permalink in my comment).

Line 5924 is in the function getSegments(), which as far as I can tell has been unused since the first commit on GitHub back in 2017. I don't see how this is possible.

And I'm sure that the second file is parsed...

My apologies! I misread this part of the code, and I see where this happens now. I also think I see the bug, although I haven't looked at a multi-file TRF output to be sure. Each call to parseOutput resets qryID = "", but for files other than the first I assume it should be reusing the last qryID seen in the previous file.

I expect that the best way to fix this properly is to parse a format other than the HTML output, but a less invasive fix would be to preserve the previous qryID across calls to parseOutput.


You can see that the file name is quite strange, maybe I've a glitch in the file, because aaaa[] works and baaaa[] fails...

Now I'm a bit lost again. Are you really only changing the name of the input file to RepeatMasker and that consistently makes a difference? Is it the entire RepeatMasker run that "works" or "fails", or something else? How do the TRF result files compare between the two runs, and in particular does the working run have a Sequence: line early in the second file that is missing in the failed run?

FredericBGA commented 3 years ago

My apologies! The getQueryName() involved is line 5974, in postProcessSearch (here)

I've just created a git repository. https://github.com/FredericBGA/trf-repeatmasker (when I clone the repo in another location, I can reproduce the behavior) I've seen with my colleague, I can share the involved fasta file.

The README holds my command lines.

I launch TRF on the file with the long name. The output is weird as the second file baaaa[]aa.2.txt.html does not hold the Sequence:.

I copy the file with the long name. I launch TRF on this new file. The output seems good as the second file batch-2.masked.2.3.5.75.20.33.7.2.txt.html holds the Sequence:

I'm really sorry if the issue is located between my chair and my keyboard, but this let me stuck and I would like to understand! Thank you for your help and ideas.

jebrosen commented 3 years ago

I've just created a git repository.

I think this sample input/output cleared up all of my questions; thanks so much for reporting this and for helping track down the cause.

I'm really sorry if the issue is located between my chair and my keyboard, but this let me stuck and I would like to understand! Thank you for your help and ideas.

I've reproduced this, and I'm confident it's a bug/deficiency in TRF, not in RepeatMasker itself nor user error.

Here is something that jumped out at me from one of the "1 of 2" files:

<HTML><HEAD><TITLE>baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa_batch-2.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY bgcolor="#FBFile 1 of 2

F8BC"><PRE>

<BODY bgcolor="#FBFile 1 of 2<newline><newline>F8BC"> is clearly wrong. I did some sleuthing, and found a hardcoded line length of 200 in https://github.com/Benson-Genomics-Lab/TRF/blob/master/src/trfclean.h#L569. The long filename causes some lines to be longer than that, and their contents are pushed into other lines and/or skipped when reassembling the complete output later.

The result of this bug is that RepeatMasker will fail on any file whose name is too long (I estimate 96 characters), and where any batch has more than 120 repeats identified by TRF.


For a workaround, I would suggest running RepeatMasker again with a shorter input filename -- hopefully this is the only problem that affects your input file! If that does work, I propose that RepeatMasker use shorter names for some of these intermediate files. Shorter names are both more convenient to work with when troubleshooting and less likely to expose bugs such as these in other programs now and in the future.

FredericBGA commented 3 years ago

very nice! I'm happy to see that the issue is reproducible. unfortunately I can not easily managed the length of the input filenames (as RepeatMasker is launche in this particular case by another tool: Maker I will open and describe the issue to the TRF repo! many thanks!

rmhubley commented 2 months ago

Sorry for the absurdly late response. In an attempt to cleanup the open issues this issue has been closed. Please submit a new issue if this is a still a current problem.